Re: [lucy-dev] Unicode integration

Nick Wellnhofer Wed, 16 Nov 2011 14:25:01 -0800

On 16/11/11 04:49, Marvin Humphrey wrote:

It would be great to support accent stripping in Lucy -- that's something a
lot of people need.  Normalization would also be a nice feature to offer
(Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
replacement?).

Thinking about the implications of Unicode in the analyzer chain, I'vecome to the conclusion that the first step should always betokenization. In the current implementation the CaseFolder comes firstin the chain by default. But case folding (or lowercasing) can add orremove Unicode codepoints and mess with the character offsets for thehighlighter. See the attached script for a demonstration.

It would also be great to migrate Lucy::Analysis::CaseFolder code away from
its dependency on the Perl C API.

Yes, we could even do proper Unicode case folding, normalization andaccent stripping in one pass with utf8proc. This should be the next stepafter tokenization. The stopalizer and stemmers should be safe whenusing NFC or NFKC. I think we can leave the choice between thesenormalization forms to the user.

If we go with utf8proc, I would propose a new analyzerLucy::Analysis::Normalizer with the following interface:


my $normalizer = Lucy::Analysis::Normalizer->new(
    normalization_form => $string,
    case_fold          => $bool,
    strip_accents      => $bool,
);

normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. Thedecomposed forms won't play well with other analyzers but could beeasily added for completeness. I'm not sure whether we should default toNFC or NFKC.

case_fold and strip_accents are simple on/off switches. By defaultcase_fold is enabled and strip_accents disabled.


The default analyzer chain would be tokenize, normalize, stem.

Lucy::Analysis::CaseFolder could then be implemented as a subclass ofLucy::Analysis::Normalizer for compatibility.

Further idea: implement a simple and fast tokenizer in core based on theUnicode character class table provided with utf8proc.


Nick

#!perl
use strict;

use Lucy;

my $schema   = Lucy::Plan::Schema->new;
my $analyzer = Lucy::Analysis::PolyAnalyzer->new(language => 'en');

$schema->spec_field(
    name => 'text',
    type => Lucy::Plan::FullTextType->new(
        analyzer      => $analyzer,
        highlightable => 1,
    ),
);

my $indexer = Lucy::Index::Indexer->new(
    index    => 'lucy_unicode_bug_index',
    schema   => $schema,
    create   => 1,
    truncate => 1,
);

$indexer->add_doc({
    text => chr(0x0130) x 5 . ' look where the highlight is',
});

$indexer->commit;

my $searcher = Lucy::Search::IndexSearcher->new(
    index => 'lucy_unicode_bug_index',
);

my $query = Lucy::Search::TermQuery->new(
    field => 'text',
    term  => 'highlight',
);

my $highlighter = Lucy::Highlight::Highlighter->new(
    searcher => $searcher,
    query    => $query,
    field    => 'text'
);

my $hits = $searcher->hits(query => $query);

while (my $hit = $hits->next) {
    my $excerpt = $highlighter->create_excerpt($hit);

    print("$excerpt\n");
}

Re: [lucy-dev] Unicode integration

Reply via email to