On Tue, Jul 12, 2011 at 05:54:35PM +0200, Jens Krämer wrote:
> On 12.07.2011, at 09:39, arjan wrote:
> > What you could do to match words with and without accents is adding an
> > extra field for the content without accents. There are perl modules
> > available to replace accented characters. This is called "normalization
> > form d".
>
> Wouldn't doing so break the highlighting of matching terms because the hit
> for 'cafe' then would occur in the normalized field, but not in the 'main'
> field that most probably would be used for showing the excerpt?
Yes, that's right.
> I don't know Lucy (yet ;-) but I've done lots of work with Lucene and
> Ferret, and there I usually normalize accented characters (and german
> umlauts) with a special token filter that's part of a custom analyzer.
This is technically possible, though Lucy's Analyzer subclassing API has been
temporarily redacted in anticipation of refactoring, and so is not officially
supported or documented.
package NormDAnalyzer;
use base qw( Lucy::Analysis::Analyzer;
use Unicode::Normalize qw( normalize );
sub transform {
my ($self, $inversion) = @_;
while (my $token = $inversion->next) {
$token->set_text(normalize('D', $token->get_text));
}
$inversion->reset;
return $inversion;
}
> Imho treating 'é' like 'e' should be no harder than treating 'E' like 'e',
> so I'd say going where your tokens are being downcased and hooking in there
> to additionally perform more normalizations should be the way to go. But as
> I said, I have no idea if and how this is possible in Lucy...
I agree. I'd just been putting off the Big Discussion about the Analysis
chain ;) ... and I'd somehow missed that highlighting was part of Grant's
requirements.
Good catch, Jens.
Marvin Humphrey