These are very good questions indeed. I'm afraid I don't have the answers but I'd like to add some questions and remarks of my own and hope someone will eventually provide some insight.
On 02.11.2006, at 23:57, Chris Gansen wrote: > I'm currently investigating support for Ferret and content that > spans multiple locales. I am particularly interested in using > stemming and fuzzy searches (e.g. with slop factor) across multiple > locales. > > So far I've followed the online docs for implementing a Stemming > Analyzer, and it is working for English terms just fine. I've also > written a method to import data from the legacy XML files and save > as ActiveRecord objects (using AAF). However, I'm not certain the > the locale-switching is working properly: > > doc = Document.import_from_xml(filename) > Ferret::locale = doc.locale_id # locale_id is "en.UTF-8" or > "fr.UTF-8" for example > doc.save I don't think setting the locale has any effect on already created StemFilters and StopFilters, so the above code doesn't change anything. According to the docs the locale setting doesn't even affect the default stop words or stemming algorithms used when creating a new StopFilter or StemFilter, respectively. The default language is English in both cases, no matter what the current locale is. This leads me to the ultimate question: What is the locale setting good for anyway? Could it be that only the character encoding portion of the locale string is actually relevant? > What's the best way to handle the import of data, where locale is > changing from document to document? What other considerations > should I keep in mind when using Ferret across multiple locales? From what I have observed, you'll need to create different Analyzers with a StemFilter and StopFilter explicitly created for the respective locale. I don't know about French but the German stemming algorithm is very inaccurate. Stemming algorithms for the English language are probably easier to implement, since German and French have more complex rules and lots of exceptions. But even the English stemming algorithm seems to be entirely rule-based and thus fails on irregular verbs. I think it might be a good idea to provide a facility to extend the stemmer, very much like the inflection rules can be extended in Rails. Cheers, Andy _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

