On Sep 21, 2006, at 10:20 PM, David Balmain wrote:

> On 9/22/06, Francis Hwang <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> We're using Ferret in a slightly unorthodox way: We're indexing a
>> large (>100,000) list of names of places all around the world. Mostly
>> we're quite happy with it, and have been able to graft on our own
>> particular required functionality with just a little tweaking.
>>
>> There's one strange problem, though: We've got a place in Cyprus
>> called "Gazima\304\237usa" (that \304\237 is a multibyte character in
>> UTF-8), and it matches a search for "usa". We'd rather it not match.
>> I don't know that much about Ferret or about this sort of indexing in
>> general, but is this because Ferret views \304\237 as a word break,
>> and splits the name into two words? If so, is there a way you'd
>> recommend to get around this -- keeping in mind that we've got names
>> in romanized forms of many different languages?
>>
>> Thanks in advance,
>>
>> Francis
>
> Hi Francis,
>
> It is because Ferret sees that as a word break. This must be either
> because you are using an ASCII Analzyer (which I doubt) or your locale
> isn't set to handle UTF-8. You can set your locale like this:
>
>     ENV['LANG'] = 'en_US.utf8'
>
> Or use whatever locale your data is stored as. Let me know if that  
> helps.
>
> Cheers,
> Dave
>
> PS: if not all your data is UTF-8 you may need to convert it. In that
> case you should check out the Ruby's iconv standard library.

I tried that and it made no difference. The data is in UTF-8 already.  
And as far as the analyzer, we're just using the StandardAnalyzer. (I  
actually don't know much about what all the different analyzers do,  
at any rate.) Any other ideas?

Francis

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to