Yeah, so this page shows that c11x regex is still mostly unsupported in gcc:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1
(see section 7)
And the old school gnu regex we use otherwise I don't think knows
anything about wide chars. It simply compares bytes and does have a
clue if some should be considered part of the same byte. I suspect that
because nowhere do we tell it that we're giving it UTF-8.
Ultimately my hope is that gcc will improve eventually and solve our
problem for us. We could use
We could add an option to use ICU RegexMatcher, but I'm still holding
out for our compiler.
Troy
On 03/06/2017 05:52 PM, Karl Kleinpaste wrote:
On 03/06/2017 05:25 PM, Greg Hellings wrote:
being off by 2 would seem strange to me
I don't understand this question at all.
0xE2 = 226 = 0342
0x80 = 128 = 0200
0x93 = 147 = 0223
There's no off-by error at all.
"od" is the "octal dump" tool; given -c, it tries to dump characters,
but outside 7-bit ASCII, it still dumps octal.
For those familiar with dc(1), this will make sense
$ dc
8o
226p
342
128p
200
147p
223
16i
0XE2p
342
0X80p
200
0X93p
223
The interesting questions are why C++11 regex can't find /en dash/,
and why non-C++11 regex doesn't understand multibyte.
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page