[lucy-dev] Unicode integration

Nick Wellnhofer Tue, 15 Nov 2011 12:52:23 -0800


(Moving the "Custom analyzers" thread from lucy-user to lucy-dev)


On 15/11/11 05:22, Marvin Humphrey wrote:

On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote:

Would it make sense to have all the Unicode functionality in the Lucy
core using a third party Unicode library? Or should we rely on the
Unicode support of the host language like we do for case folding?


That hinges on the dependability, portability, licensing terms and
ease-of-integration for this theoretical third party Unicode library.
Dependencies are cool so long as we can bundle them, they don't take a million
years to compile, they don't sabotage all the hard work we've done to make
Lucy portable, etc.  (For a longer take on dependencies, see
<http://markmail.org/message/2zsunkfleqocix67>.)

If all dependencies must be bundled, we can rule out something like ICU[1] because it's simply too big.

One alternative I could find is utf8proc [2]. It's 20K of C code,MIT-licensed and used for Postgres extensions and a Ruby gem. Itsupports Unicode normalization, case folding and stripping of accents.

Then there's the Perl module Unicode::Normalize with very similarfunctionality. But I'm not sure if the Perl License is compatible withthe Apache License.

One downside of bundling a Unicode library is that they all need somerather large tables. utf8proc comes with a 1.2 MB .c file containing thetables. The whole library compiles to about 500 KB. Unicode::Normalizebuilds its tables from the Unicode database files that come with Perland compiles to about 300 KB. All this on i386, 32 bit.

On the positive side, we'd have things like case folding, normalizationand accent stripping directly in core. We'd also get Unicode featuresfor new host languages out of the box and it's the only way to make sureUnicode is handled consistently across different host languages andclient platforms. The latter might be a rather academic concern, though.


Nick

[1] http://icu-project.org/
[2] http://www.public-software-group.org/utf8proc

[lucy-dev] Unicode integration

Reply via email to