Re: [lucy-user] Indexing HTML documents

Marvin Humphrey Tue, 12 Jul 2011 18:43:40 -0700

On Tue, Jul 12, 2011 at 05:54:35PM +0200, Jens Krämer wrote:
> On 12.07.2011, at 09:39, arjan wrote:
> > What you could do to match words with and without accents is adding an
> > extra field for the content without accents. There are perl modules
> > available to replace accented characters. This is called "normalization
> > form d".
> 
> Wouldn't doing so break the highlighting of matching terms because the hit
> for 'cafe' then would occur in the normalized field, but not in the 'main'
> field that most probably would be used for showing the excerpt?


Yes, that's right.

> I don't know Lucy (yet ;-) but I've done lots of work with Lucene and
> Ferret, and there I usually normalize accented characters (and german
> umlauts) with a special token filter that's part of a custom analyzer.
 
This is technically possible, though Lucy's Analyzer subclassing API has been
temporarily redacted in anticipation of refactoring, and so is not officially
supported or documented.

  package NormDAnalyzer;
  use base qw( Lucy::Analysis::Analyzer;
  use Unicode::Normalize qw( normalize );

  sub transform {
    my ($self, $inversion) = @_;
    while (my $token = $inversion->next) {
      $token->set_text(normalize('D', $token->get_text)); 
    }
    $inversion->reset;
    return $inversion;
  }

> Imho treating 'é' like 'e' should be no harder than treating 'E' like 'e',
> so I'd say going where your tokens are being downcased and hooking in there
> to additionally perform more normalizations should be the way to go. But as
> I said, I have no idea if and how this is possible in Lucy...

I agree.  I'd just been putting off the Big Discussion about the Analysis
chain ;) ... and I'd somehow missed that highlighting was part of Grant's
requirements.

Good catch, Jens.

Marvin Humphrey

Re: [lucy-user] Indexing HTML documents

Reply via email to