Re: [lucy-dev] Unicode integration

Marvin Humphrey Tue, 15 Nov 2011 19:50:11 -0800

On Tue, Nov 15, 2011 at 09:51:49PM +0100, Nick Wellnhofer wrote:
> One alternative I could find is utf8proc [2]. It's 20K of C code,  
> MIT-licensed and used for Postgres extensions and a Ruby gem. It  
> supports Unicode normalization, case folding and stripping of accents.


utf8proc also supplies some low-level routines which we might be able to use
to replace stuff in Lucy/Util/StringHelper.c.

It compiles plenty fast, it's just two .c files (one of which includes the
other) and one .h file.  The Makefile isn't portable, but I see that they've
taken pains to accommodate MSVC in the C files, so they'll probably compile
everywhere that Lucy does.

I see no problems with bundling utf8proc as a dependency.

Looks like a great find, Nick!

> One downside of bundling a Unicode library is that they all need some  
> rather large tables. utf8proc comes with a 1.2 MB .c file containing the  
> tables. The whole library compiles to about 500 KB. Unicode::Normalize  
> builds its tables from the Unicode database files that come with Perl  
> and compiles to about 300 KB. All this on i386, 32 bit.

With the current state of things, adding 500 KB is unlikely to make a
difference.

On my Mac running Snow Leopard, adding utf8proc pushes the compiled size of
Lucy.bundle from 2.8 MB to 3.3 MB.  The largest compiled objects contributing
to that tally are Lucy.o at 1.2 MB, compiled from Lucy.xs, and autogen/parcel.c
at 1.2 MB, which contains Clownfish OO support such as vtables.  The Snowball
stemmers add around 200 KB, and the Snowball stoplists around 100 KB.

If we put our minds to it, we could slim down Lucy.o and parcel.o -- maybe by
a lot, since nobody's ever bothered to work on optimizing for space.  The same
wouldn't hold true for utf8proc (or the Snowball materials).  But that just
doesn't seem important given all the other reasons utf8proc looks like a good
fit.

> On the positive side, we'd have things like case folding, normalization  
> and accent stripping directly in core.

It would be great to support accent stripping in Lucy -- that's something a
lot of people need.  Normalization would also be a nice feature to offer
(Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
replacement?).

It would also be great to migrate Lucy::Analysis::CaseFolder code away from
its dependency on the Perl C API.

> We'd also get Unicode features  for new host languages out of the box and
> it's the only way to make sure  Unicode is handled consistently across
> different host languages and  client platforms. The latter might be a rather
> academic concern, though.

Personally, I don't see cross-host index compatibility as so important that we
ought to make big sacrifices to achieve it.

Regardless, integrating utf8proc seems worthwhile for lots of other reasons. 

+1 from me!

Marvin Humphrey

Re: [lucy-dev] Unicode integration

Reply via email to