On 9/22/06, Francis Hwang <[EMAIL PROTECTED]> wrote:
> Hi,
>
> We're using Ferret in a slightly unorthodox way: We're indexing a
> large (>100,000) list of names of places all around the world. Mostly
> we're quite happy with it, and have been able to graft on our own
> particular required functionality with just a little tweaking.
>
> There's one strange problem, though: We've got a place in Cyprus
> called "Gazima\304\237usa" (that \304\237 is a multibyte character in
> UTF-8), and it matches a search for "usa". We'd rather it not match.
> I don't know that much about Ferret or about this sort of indexing in
> general, but is this because Ferret views \304\237 as a word break,
> and splits the name into two words? If so, is there a way you'd
> recommend to get around this -- keeping in mind that we've got names
> in romanized forms of many different languages?
>
> Thanks in advance,
>
> Francis
Hi Francis,
It is because Ferret sees that as a word break. This must be either
because you are using an ASCII Analzyer (which I doubt) or your locale
isn't set to handle UTF-8. You can set your locale like this:
ENV['LANG'] = 'en_US.utf8'
Or use whatever locale your data is stored as. Let me know if that helps.
Cheers,
Dave
PS: if not all your data is UTF-8 you may need to convert it. In that
case you should check out the Ruby's iconv standard library.
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk