Yeah, so this page shows that c11x regex is still mostly unsupported in gcc:

http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1

(see section 7)

And the old school gnu regex we use otherwise I don't think knows anything about wide chars. It simply compares bytes and does have a clue if some should be considered part of the same byte. I suspect that because nowhere do we tell it that we're giving it UTF-8.

Ultimately my hope is that gcc will improve eventually and solve our problem for us. We could use

We could add an option to use ICU RegexMatcher, but I'm still holding out for our compiler.

Troy


On 03/06/2017 05:52 PM, Karl Kleinpaste wrote:
On 03/06/2017 05:25 PM, Greg Hellings wrote:
being off by 2 would seem strange to me
I don't understand this question at all.

0xE2 = 226 = 0342
0x80 = 128 = 0200
0x93 = 147 = 0223

There's no off-by error at all.

"od" is the "octal dump" tool; given -c, it tries to dump characters, but outside 7-bit ASCII, it still dumps octal.

For those familiar with dc(1), this will make sense
$ dc
8o
226p
342
128p
200
147p
223
16i
0XE2p
342
0X80p
200
0X93p
223

The interesting questions are why C++11 regex can't find /en dash/, and why non-C++11 regex doesn't understand multibyte.


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to