> Once you have a regex library that handles codepoints, the code that uses
> it doesnt have to care about them in particular.

It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8).

Why is it not so simple?I just want to know some basic information:
Does it match or not. What range of bytes in the string was matched.

I don't care what the regex library does under the covers, and I
shouldnt have to care...
I can safely extract substrings on those boundaries now if it did its job right.

If it knows how to match "Á" to ".", then I dont have to know how it
goes about doing so.
Even better if the regex engine handles both normalization forms
transparently. My code should never have to care. I shouldnt have to
jump through hoops, and call all sort of fancy "binmode" settings or
perform "Encode::decode" incantantions everywhere to turn my scalars
back into plain old strings.
.)��D��-|��ˊ{��v��W�z[��b��m�������Yb��h���{���

Reply via email to