On Sun, Mar 31, 2002 at 03:53:52PM -0600, David Starner wrote: > The dict standard dictates that all data crossing the wire shall be in > UTF-8. Unfortunately, the reference implementation doesn't even try to > get it right. I was discussing the issue with a maintainer of a Russian > dictionary for dict, and part of the problem was that there was no UTF-8 > regex engine. Does anyone know of a UTF-8 regex engine, preferably one > that can be plugged into a GPL'ed C program easily?
I know GNU grep (at least alpha versions) implement generic multibyte. That's not an easy drop-in, of course. It was also orders of magnitude slower; I don't know if it was simply unoptimized. pcre(7) mentions experimental UTF-8 support. I havn't tried it. By the description, it looks extremely limited. In particular: > 5. A class is matched against a UTF-8 character instead of just a > single byte, but it can match only characters whose values are less > than 256. Characters with greater values always fail to match a class. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/