Re: [lucy-user] Indexing HTML documents

Jens Krämer Tue, 12 Jul 2011 08:55:11 -0700

Hi!

On 12.07.2011, at 09:39, arjan wrote:
> 
> What you could do to match words with and without accents is adding an extra 
> field for the content without accents. There are perl modules available to 
> replace accented characters. This is called "normalization form d".

Wouldn't doing so break the highlighting of matching terms because the hit for 
'cafe' then would occur in the normalized field, but not in the 'main' field 
that most probably would be used for showing the excerpt?

I don't know Lucy (yet ;-) but I've done lots of work with Lucene and Ferret, 
and there I usually normalize accented characters (and german umlauts) with a 
special token filter that's part of a custom analyzer.

Imho treating 'é' like 'e' should be no harder than treating 'E' like 'e', so 
I'd say going where your tokens are being downcased and hooking in there to 
additionally perform more normalizations should be the way to go. But as I 
said, I have no idea if and how this is possible in Lucy...

Cheers,
Jens

> On 12-07-11 07:28, Grant McLean wrote:
>> On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
>>> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
[..]
>> 
>> The final issue I'd like to tackle is the handling of accents.  Ideally
>> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
>> should be able to type a query with-or-without the accent and match
>> documents with-or-without the accent and have the excerpt highlighting
>> pick up words with-or-without the accent.  I would prefer not to have
>> the search results and excerpts lacking accents if they are present in
>> the source document.  Is this dream scenario possible?  Perhaps with
>> synonyms?  Can anyone suggest an approach?
>> 
>> Thanks
>> Grant
>> 

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/

Re: [lucy-user] Indexing HTML documents

Reply via email to