On Sep 21, 2006, at 10:20 PM, David Balmain wrote: > On 9/22/06, Francis Hwang <[EMAIL PROTECTED]> wrote: >> Hi, >> >> We're using Ferret in a slightly unorthodox way: We're indexing a >> large (>100,000) list of names of places all around the world. Mostly >> we're quite happy with it, and have been able to graft on our own >> particular required functionality with just a little tweaking. >> >> There's one strange problem, though: We've got a place in Cyprus >> called "Gazima\304\237usa" (that \304\237 is a multibyte character in >> UTF-8), and it matches a search for "usa". We'd rather it not match. >> I don't know that much about Ferret or about this sort of indexing in >> general, but is this because Ferret views \304\237 as a word break, >> and splits the name into two words? If so, is there a way you'd >> recommend to get around this -- keeping in mind that we've got names >> in romanized forms of many different languages? >> >> Thanks in advance, >> >> Francis > > Hi Francis, > > It is because Ferret sees that as a word break. This must be either > because you are using an ASCII Analzyer (which I doubt) or your locale > isn't set to handle UTF-8. You can set your locale like this: > > ENV['LANG'] = 'en_US.utf8' > > Or use whatever locale your data is stored as. Let me know if that > helps. > > Cheers, > Dave > > PS: if not all your data is UTF-8 you may need to convert it. In that > case you should check out the Ruby's iconv standard library.
I tried that and it made no difference. The data is in UTF-8 already. And as far as the analyzer, we're just using the StandardAnalyzer. (I actually don't know much about what all the different analyzers do, at any rate.) Any other ideas? Francis _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

