On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

Regarding the problem at hand, it's becoming painfully obvious to me that the lexer MUST do its own decoding internally.

That's not a great surprise to me. I hit the same issues when writing my XML parser, which is why I invented functions called frontUnit and popFrontUnit. I'm glad you're realizing this.

Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[].

That's ugly, but it could work (assuming s.representation returns the casted range by ref). I still prefer my frontUnit and popFrontUnit approach though.

In fact, any parser for which speed is important will have to bypass std.range's clever handling of UTF characters. Dealing simply with ubytes isn't enough, since in some cases you'll want to fire up the UTF decoder.

The next issue, which I haven's seen discussed here is that for a parser to be efficient it should operate on buffers. You can make it work with arbitrary ranges, but if you don't have a buffer you can slice when you need to preserve a string, you're going to have to build the string character by character, which is not efficient at all. But then you can only really return slices if the underlying representation is the same as the output representation, and unless your API has a templated output type, you're going to special case a lot of things.

After having attempted an XML parser with ranges, I'm not sure parsing using generic ranges can be made very efficient. Automatic conversion to UTF-32 is a nuisance for performance, and if the output needs to return parts of the input, you'll need to create an inefficient special case just to allocate many new strings in the correct format.

I wonder how your call with Walter will turn out.

--
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca/

Reply via email to