On 8/1/2012 11:56 PM, Jonathan M Davis wrote:
Another thing that I should point out is that a range of UTF-8 or UTF-16
wouldn't work with many range-based functions at all. Most of std.algorithm
and its ilk would be completely useless. Range-based functions operate on a
ranges elements, so operating on a range of code units would mean operating on
code units, which is going to be _wrong_ almost all the time. Think about what
would happen if you used a function like map or filter on a range of code
units. The resultant range would be completely invalid as far as unicode goes.

My experience in writing fast string based code that worked on UTF8 and correctly handled multibyte characters was that they are very possible and practical, and they are faster.

The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.

I think there's some serious underestimation of how critical this is.



Range-based functions need to be operating on _characters_. Technically, not
even code points gets us there, so it's _still_ buggy. It's just a _lot_
closer to being correct and works 99.99+% of the time.

Multi-code point characters are quite irrelevant to the correctness of a D 
lexer.


If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add
a concept of variably-length encoded ranges so that it's possible to treat
them as both their encoding and whatever they represent (e.g. code point or
grapheme in the case of ranges of code units).

No, this is not necessary.

Reply via email to