On 2010-05-04 19:41:39 -0400, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

Anyway, just in case, would you be interested in an XML tokenizer and simple DOM following this model?

    http://michelf.com/docs/d/mfr/xmltok.html
    http://michelf.com/docs/d/mfr/xml.html

At the base is a pull parser and an event parser mixed in the same function template: "tokenize", allowing you to alternate between even-based and pull-parsing at will. I'm using it, but its development is on hold at this time, I'm just maintaining it so it compiles on the newest versions of DMD.

Sounds great, but I need to defer XML expertise to others.

If someone else wants to use it, I offer it. Otherwise I'll surely continue working on it eventually.


Anyway, the problem above is probably the one reason we might want to write the parser from scratch: it needs to bind to specializable higher-level parsing functions to take advantage of the performance characteristics of certain ranges, such as those you can slice.

There are a number of issues. One is that you should allow wchar and dchar in addition to char as basic character types (probably ubyte too for exotic encodings). In essence the char type should be a template parameter.

I totally agree about wchar and dchar... and you also need ubyte with an encoding detection system (checking the encoding in the xml prolog) to correctly parse XML files in any encoding. I think I have a ubyte parser, but it currently just accepts UTF-8 and then branch to the "string" version. (UTF-16 would be needed to implement correctly the XML spec.)


The other is that perhaps you could be able to use zero-based slices, i.e. s[0 .. i] as opposed to arbitrary slices s[i .. j]. A zero-based slice can be supported better than an arbitrary one.

Yeah, well, it doesn't really work so well. You have to parse the input before slicing. XML doesn't tell you in advance how many characters or code units a string will take.

What you need is more like this:

        // Example XML Parser

        bool isAtEndOfAttributeContent(dchar char) {
                if (char == '"') return true;
                if (char == '<') throw new ParseError();
                return false;
        }

        void parseXML(T)(T input) if (IsInputRange!T) {
                [...]
                case '"':
                        input.popFront(); // remove leading quote
                        string content = 
readUntil!(isAtEndOfAttributeContent)(input);
                        assert(input.front == '"');
                        input.popFront(); // remove tailing quote
                [...]
        }


        // Example parsing primitive

        // String version: can slice
        string readUntil(isAtEndPredicate)(ref string input) {
                string savedInput;
                while (!input.empty && isAtEndPredicate(input.front)) {
                        input.popFront();
                }
                return savedInput[0..$-input.length];
        }

        // Generic input range version: can't slice, must copy
immutable(ElementType!T)[] readUntil(isAtEndPredicate, T)(T input) if (IsInputRange!T) {
                immutable(ElementType!T)[] copy; // should use appender here
                while (!input.empty) {
                        dchar frontChar = input.front;
                        if (isAtEndPredicate(frontChar))
                                break;
                        else
                        copy ~= frontChar;
                        input.popFront();
                }
                return copy;
        }

It's easy to appreciate the difference in performance between the string version and the generic version of readUntil just by looking at the code.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to