Re: Phobos Proposal: replace std.xml with kxml.

Michel Fortin Tue, 04 May 2010 19:00:21 -0700

On 2010-05-04 19:41:39 -0400, Andrei Alexandrescu<seewebsiteforem...@erdani.org> said:

Anyway, just in case, would you be interested in an XML tokenizer andsimple DOM following this model?
    http://michelf.com/docs/d/mfr/xmltok.html
    http://michelf.com/docs/d/mfr/xml.html
At the base is a pull parser and an event parser mixed in the samefunction template: "tokenize", allowing you to alternate betweeneven-based and pull-parsing at will. I'm using it, but its developmentis on hold at this time, I'm just maintaining it so it compiles on thenewest versions of DMD.
Sounds great, but I need to defer XML expertise to others.

If someone else wants to use it, I offer it. Otherwise I'll surelycontinue working on it eventually.

Anyway, the problem above is probably the one reason we might want towrite the parser from scratch: it needs to bind to specializablehigher-level parsing functions to take advantage of the performancecharacteristics of certain ranges, such as those you can slice.
There are a number of issues. One is that you should allow wchar anddchar in addition to char as basic character types (probably ubyte toofor exotic encodings). In essence the char type should be a templateparameter.

I totally agree about wchar and dchar... and you also need ubyte withan encoding detection system (checking the encoding in the xml prolog)to correctly parse XML files in any encoding. I think I have a ubyteparser, but it currently just accepts UTF-8 and then branch to the"string" version. (UTF-16 would be needed to implement correctly theXML spec.)

The other is that perhaps you could be able to use zero-based slices,i.e. s[0 .. i] as opposed to arbitrary slices s[i .. j]. A zero-basedslice can be supported better than an arbitrary one.

Yeah, well, it doesn't really work so well. You have to parse the inputbefore slicing. XML doesn't tell you in advance how many characters orcode units a string will take.


What you need is more like this:

        // Example XML Parser

        bool isAtEndOfAttributeContent(dchar char) {
                if (char == '"') return true;
                if (char == '<') throw new ParseError();
                return false;
        }

        void parseXML(T)(T input) if (IsInputRange!T) {
                [...]
                case '"':
                        input.popFront(); // remove leading quote
                        string content = 
readUntil!(isAtEndOfAttributeContent)(input);
                        assert(input.front == '"');
                        input.popFront(); // remove tailing quote
                [...]
        }


        // Example parsing primitive

        // String version: can slice
        string readUntil(isAtEndPredicate)(ref string input) {
                string savedInput;
                while (!input.empty && isAtEndPredicate(input.front)) {
                        input.popFront();
                }
                return savedInput[0..$-input.length];
        }

        // Generic input range version: can't slice, must copy

immutable(ElementType!T)[] readUntil(isAtEndPredicate, T)(T input) if(IsInputRange!T) {

                immutable(ElementType!T)[] copy; // should use appender here
                while (!input.empty) {
                        dchar frontChar = input.front;
                        if (isAtEndPredicate(frontChar))
                                break;
                        else
                        copy ~= frontChar;
                        input.popFront();
                }
                return copy;
        }

It's easy to appreciate the difference in performance between thestring version and the generic version of readUntil just by looking atthe code.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: Phobos Proposal: replace std.xml with kxml.

Reply via email to