Am 25.08.2014 23:53, schrieb "Ola Fosheim Grøstad"
<ola.fosheim.grostad+dl...@gmail.com>":
On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
But why should UTF validation be the job of the lexer in the first place?
Because you want to save time, it is faster to integrate validation? The
most likely use scenario is to receive REST data over HTTP that needs
validation.
Well, so then I agree with Andrei… array of bytes it is. ;-)
added as a separate proxy range. But if we end up going for validating
in the lexer, it would indeed be enough to validate inside strings,
because the rest of the grammar assumes a subset of ASCII.
Not assumes, but defines! :-)
I guess it depends on if you look at the grammar as productions or
comprehensions(right term?) ;)
If you have to validate UTF before lexing then you will end up
needlessly scanning lots of ascii if the file contains lots of
non-strings or is from a encoder that only sends pure ascii.
That's true. So the ideal solution would be to *assume* UTF-8 when the
input is char based and to *validate* if the input is "numeric".
If you want to have "plugin" validation of strings then you also need to
differentiate strings so that the user can select which data should be
just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing
double validation (you have to bypass >7F followed by string-end anyway).
The advantage of integrated validation is that you can use 16 bytes SIMD
registers on the buffer.
I presume you can load 16 bytes and do BITWISE-AND on the MSB, then
match against string-end and carefully use this to boost performance of
simultanous UTF validation, escape-scanning, and string-end scan. A bit
tricky, of course.
Well, that's something that's definitely out of the scope of this
proposal. Definitely an interesting direction to pursue, though.
At least no UTF validation is needed. Since all non-ASCII characters
will always be composed of bytes >0x7F, a sequence \uXXXX can be
assumed to be valid wherever in the string it occurs, and all other
bytes that don't belong to an escape sequence are just passed through
as-is.
You cannot assume \u… to be valid if you convert it.
I meant "X" to stand for a hex digit. The point was just that you don't
have to worry about interacting in a bad way with UTF sequences when you
find "\uXXXX".