[Ferret-talk] strange matching: maybe a multilanguage collation problem?

Francis Hwang Thu, 21 Sep 2006 16:47:17 -0700

Hi,

We're using Ferret in a slightly unorthodox way: We're indexing a  
large (>100,000) list of names of places all around the world. Mostly  
we're quite happy with it, and have been able to graft on our own  
particular required functionality with just a little tweaking.


There's one strange problem, though: We've got a place in Cyprus  
called "Gazima\304\237usa" (that \304\237 is a multibyte character in  
UTF-8), and it matches a search for "usa". We'd rather it not match.  
I don't know that much about Ferret or about this sort of indexing in  
general, but is this because Ferret views \304\237 as a word break,  
and splits the name into two words? If so, is there a way you'd  
recommend to get around this -- keeping in mind that we've got names  
in romanized forms of many different languages?

Thanks in advance,

Francis
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

Reply via email to