Re: [Ferret-talk] Indexing and searching across multiple locales

Andreas Korth Fri, 03 Nov 2006 13:51:23 -0800

These are very good questions indeed. I'm afraid I don't have the  
answers but I'd like to add some questions and remarks of my own and  
hope someone will eventually provide some insight.

On 02.11.2006, at 23:57, Chris Gansen wrote:

> I'm currently investigating support for Ferret and content that  
> spans multiple locales. I am particularly interested in using  
> stemming and fuzzy searches (e.g. with slop factor) across multiple  
> locales.
>
> So far I've followed the online docs for implementing a Stemming  
> Analyzer, and it is working for English terms just fine. I've also  
> written a method to import data from the legacy XML files and save  
> as ActiveRecord objects (using AAF). However, I'm not certain the  
> the locale-switching is working properly:
>
>     doc = Document.import_from_xml(filename)
>     Ferret::locale = doc.locale_id    # locale_id is "en.UTF-8" or  
> "fr.UTF-8" for example
>     doc.save

I don't think setting the locale has any effect on already created  
StemFilters and StopFilters, so the above code doesn't change anything.

According to the docs the locale setting doesn't even affect the  
default stop words or stemming algorithms used when creating a new  
StopFilter or StemFilter, respectively. The default language is  
English in both cases, no matter what the current locale is.

This leads me to the ultimate question: What is the locale setting  
good for anyway? Could it be that only the character encoding portion  
of the locale string is actually relevant?

> What's the best way to handle the import of data, where locale is  
> changing from document to document? What other considerations  
> should I keep in mind when using Ferret across multiple locales?

 From what I have observed, you'll need to create different Analyzers  
with a StemFilter and StopFilter explicitly created for the  
respective locale.

I don't know about French but the German stemming algorithm is very  
inaccurate. Stemming algorithms for the English language are probably  
easier to implement, since German and French have more complex rules  
and lots of exceptions. But even the English stemming algorithm seems  
to be entirely rule-based and thus fails on irregular verbs. I think  
it might be a good idea to provide a facility to extend the stemmer,  
very much like the inflection rules can be extended in Rails.

Cheers,
Andy
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Indexing and searching across multiple locales

Reply via email to