On 2010-06-28 14:27:13 -0400, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

Here's the generated documentation:

http://michelf.com/docs/d/mfr/xmltok.html
http://michelf.com/docs/d/mfr/xml.html

I'm slowly revamping it to use ranges instead of strings.

I think a tokenizer should be a higher-order range that is fed an input range of ubyte, char, wchar, or dchar (so that would be a type parameter) and is itself a range of Tokens that include the token type, token value etc.

And I've implemented a tokenizer range just like you describe on top of my tokenizer function. Look at the documentation for mfr.xmltok.XMLForwardRange. (I should probably rename it to XMLTokenRange.)

Personally, I prefer to use the callback approach which automatically calls the right function according to the token type. But what's nice about my tokenizer is that you can do both callbacks and pull-style tokenization (the later can be wrapped in a range), and mix these approaches together as needed.

What is missing is taking arbitrary ranges as input (it deals with strings currently). Strings are like the optimized case for tokenization because you don't have to dynamically allocate anything: referencing the original string is enough when making substrings. With arbitrary ranges you have to copy the text and tag names to a string one character at a time, which is less efficient. I don't want to write two separate parsers for this, so I'm trying to abstract things at the right level to maximize code reuse while keeping performance optimized for the string-as-input case, but how to do that is not so obvious.

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to