On 16/11/11 04:49, Marvin Humphrey wrote:
It would be great to support accent stripping in Lucy -- that's something a
lot of people need.  Normalization would also be a nice feature to offer
(Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
replacement?).

Thinking about the implications of Unicode in the analyzer chain, I've come to the conclusion that the first step should always be tokenization. In the current implementation the CaseFolder comes first in the chain by default. But case folding (or lowercasing) can add or remove Unicode codepoints and mess with the character offsets for the highlighter. See the attached script for a demonstration.

It would also be great to migrate Lucy::Analysis::CaseFolder code away from
its dependency on the Perl C API.

Yes, we could even do proper Unicode case folding, normalization and accent stripping in one pass with utf8proc. This should be the next step after tokenization. The stopalizer and stemmers should be safe when using NFC or NFKC. I think we can leave the choice between these normalization forms to the user.

If we go with utf8proc, I would propose a new analyzer Lucy::Analysis::Normalizer with the following interface:

my $normalizer = Lucy::Analysis::Normalizer->new(
    normalization_form => $string,
    case_fold          => $bool,
    strip_accents      => $bool,
);

normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The decomposed forms won't play well with other analyzers but could be easily added for completeness. I'm not sure whether we should default to NFC or NFKC.

case_fold and strip_accents are simple on/off switches. By default case_fold is enabled and strip_accents disabled.

The default analyzer chain would be tokenize, normalize, stem.

Lucy::Analysis::CaseFolder could then be implemented as a subclass of Lucy::Analysis::Normalizer for compatibility.

Further idea: implement a simple and fast tokenizer in core based on the Unicode character class table provided with utf8proc.

Nick
#!perl
use strict;

use Lucy;

my $schema   = Lucy::Plan::Schema->new;
my $analyzer = Lucy::Analysis::PolyAnalyzer->new(language => 'en');

$schema->spec_field(
    name => 'text',
    type => Lucy::Plan::FullTextType->new(
        analyzer      => $analyzer,
        highlightable => 1,
    ),
);

my $indexer = Lucy::Index::Indexer->new(
    index    => 'lucy_unicode_bug_index',
    schema   => $schema,
    create   => 1,
    truncate => 1,
);

$indexer->add_doc({
    text => chr(0x0130) x 5 . ' look where the highlight is',
});

$indexer->commit;

my $searcher = Lucy::Search::IndexSearcher->new(
    index => 'lucy_unicode_bug_index',
);

my $query = Lucy::Search::TermQuery->new(
    field => 'text',
    term  => 'highlight',
);

my $highlighter = Lucy::Highlight::Highlighter->new(
    searcher => $searcher,
    query    => $query,
    field    => 'text'
);

my $hits = $searcher->hits(query => $query);

while (my $hit = $hits->next) {
    my $excerpt = $highlighter->create_excerpt($hit);

    print("$excerpt\n");
}

Reply via email to