On 16/11/11 04:49, Marvin Humphrey wrote:
It would be great to support accent stripping in Lucy -- that's something a
lot of people need. Normalization would also be a nice feature to offer
(Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
replacement?).
Thinking about the implications of Unicode in the analyzer chain, I've
come to the conclusion that the first step should always be
tokenization. In the current implementation the CaseFolder comes first
in the chain by default. But case folding (or lowercasing) can add or
remove Unicode codepoints and mess with the character offsets for the
highlighter. See the attached script for a demonstration.
It would also be great to migrate Lucy::Analysis::CaseFolder code away from
its dependency on the Perl C API.
Yes, we could even do proper Unicode case folding, normalization and
accent stripping in one pass with utf8proc. This should be the next step
after tokenization. The stopalizer and stemmers should be safe when
using NFC or NFKC. I think we can leave the choice between these
normalization forms to the user.
If we go with utf8proc, I would propose a new analyzer
Lucy::Analysis::Normalizer with the following interface:
my $normalizer = Lucy::Analysis::Normalizer->new(
normalization_form => $string,
case_fold => $bool,
strip_accents => $bool,
);
normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The
decomposed forms won't play well with other analyzers but could be
easily added for completeness. I'm not sure whether we should default to
NFC or NFKC.
case_fold and strip_accents are simple on/off switches. By default
case_fold is enabled and strip_accents disabled.
The default analyzer chain would be tokenize, normalize, stem.
Lucy::Analysis::CaseFolder could then be implemented as a subclass of
Lucy::Analysis::Normalizer for compatibility.
Further idea: implement a simple and fast tokenizer in core based on the
Unicode character class table provided with utf8proc.
Nick
#!perl
use strict;
use Lucy;
my $schema = Lucy::Plan::Schema->new;
my $analyzer = Lucy::Analysis::PolyAnalyzer->new(language => 'en');
$schema->spec_field(
name => 'text',
type => Lucy::Plan::FullTextType->new(
analyzer => $analyzer,
highlightable => 1,
),
);
my $indexer = Lucy::Index::Indexer->new(
index => 'lucy_unicode_bug_index',
schema => $schema,
create => 1,
truncate => 1,
);
$indexer->add_doc({
text => chr(0x0130) x 5 . ' look where the highlight is',
});
$indexer->commit;
my $searcher = Lucy::Search::IndexSearcher->new(
index => 'lucy_unicode_bug_index',
);
my $query = Lucy::Search::TermQuery->new(
field => 'text',
term => 'highlight',
);
my $highlighter = Lucy::Highlight::Highlighter->new(
searcher => $searcher,
query => $query,
field => 'text'
);
my $hits = $searcher->hits(query => $query);
while (my $hit = $hits->next) {
my $excerpt = $highlighter->create_excerpt($hit);
print("$excerpt\n");
}