(Moving the "Custom analyzers" thread from lucy-user to lucy-dev)

On 15/11/11 05:22, Marvin Humphrey wrote:
On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote:
Would it make sense to have all the Unicode functionality in the Lucy
core using a third party Unicode library? Or should we rely on the
Unicode support of the host language like we do for case folding?

That hinges on the dependability, portability, licensing terms and
ease-of-integration for this theoretical third party Unicode library.
Dependencies are cool so long as we can bundle them, they don't take a million
years to compile, they don't sabotage all the hard work we've done to make
Lucy portable, etc.  (For a longer take on dependencies, see
<http://markmail.org/message/2zsunkfleqocix67>.)

If all dependencies must be bundled, we can rule out something like ICU [1] because it's simply too big.

One alternative I could find is utf8proc [2]. It's 20K of C code, MIT-licensed and used for Postgres extensions and a Ruby gem. It supports Unicode normalization, case folding and stripping of accents.

Then there's the Perl module Unicode::Normalize with very similar functionality. But I'm not sure if the Perl License is compatible with the Apache License.

One downside of bundling a Unicode library is that they all need some rather large tables. utf8proc comes with a 1.2 MB .c file containing the tables. The whole library compiles to about 500 KB. Unicode::Normalize builds its tables from the Unicode database files that come with Perl and compiles to about 300 KB. All this on i386, 32 bit.

On the positive side, we'd have things like case folding, normalization and accent stripping directly in core. We'd also get Unicode features for new host languages out of the box and it's the only way to make sure Unicode is handled consistently across different host languages and client platforms. The latter might be a rather academic concern, though.

Nick

[1] http://icu-project.org/
[2] http://www.public-software-group.org/utf8proc

Reply via email to