(Moving the "Custom analyzers" thread from lucy-user to lucy-dev)
On 15/11/11 05:22, Marvin Humphrey wrote:
On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote:
Would it make sense to have all the Unicode functionality in the Lucy
core using a third party Unicode library? Or should we rely on the
Unicode support of the host language like we do for case folding?
That hinges on the dependability, portability, licensing terms and
ease-of-integration for this theoretical third party Unicode library.
Dependencies are cool so long as we can bundle them, they don't take a million
years to compile, they don't sabotage all the hard work we've done to make
Lucy portable, etc. (For a longer take on dependencies, see
<http://markmail.org/message/2zsunkfleqocix67>.)
If all dependencies must be bundled, we can rule out something like ICU
[1] because it's simply too big.
One alternative I could find is utf8proc [2]. It's 20K of C code,
MIT-licensed and used for Postgres extensions and a Ruby gem. It
supports Unicode normalization, case folding and stripping of accents.
Then there's the Perl module Unicode::Normalize with very similar
functionality. But I'm not sure if the Perl License is compatible with
the Apache License.
One downside of bundling a Unicode library is that they all need some
rather large tables. utf8proc comes with a 1.2 MB .c file containing the
tables. The whole library compiles to about 500 KB. Unicode::Normalize
builds its tables from the Unicode database files that come with Perl
and compiles to about 300 KB. All this on i386, 32 bit.
On the positive side, we'd have things like case folding, normalization
and accent stripping directly in core. We'd also get Unicode features
for new host languages out of the box and it's the only way to make sure
Unicode is handled consistently across different host languages and
client platforms. The latter might be a rather academic concern, though.
Nick
[1] http://icu-project.org/
[2] http://www.public-software-group.org/utf8proc