Re: Is there a UTF-8 regex library?

Glenn Maynard Sun, 31 Mar 2002 17:43:19 -0800

On Sun, Mar 31, 2002 at 03:53:52PM -0600, David Starner wrote:
> The dict standard dictates that all data crossing the wire shall be in
> UTF-8. Unfortunately, the reference implementation doesn't even try to
> get it right. I was discussing the issue with a maintainer of a Russian
> dictionary for dict, and part of the problem was that there was no UTF-8
> regex engine. Does anyone know of a UTF-8 regex engine, preferably one
> that can be plugged into a GPL'ed C program easily?


I know GNU grep (at least alpha versions) implement generic multibyte.  That's
not an easy drop-in, of course.  It was also orders of magnitude slower;
I don't know if it was simply unoptimized.

pcre(7) mentions experimental UTF-8 support.  I havn't tried it.  By the
description, it looks extremely limited.  In particular:

> 5. A class is matched against a UTF-8 character instead of just a
> single byte, but it can match only characters whose values  are less
> than 256. Characters with greater values always fail to match a class.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Is there a UTF-8 regex library?

Reply via email to