Re: [Ferret-talk] Ferret DRB, UTF-8, Mongrel

Jens Kraemer Thu, 20 Sep 2007 13:32:42 -0700

Hi!

This is really strange - are you sure the DRb server runs in a proper
utf8 environment, just as your testcases do?


Jens

On Thu, Sep 20, 2007 at 08:01:48PM +0200, Brandon Kelly wrote:
> I have spent days trying to figure out how to get UTF-8 working with my
> site.
> 
> Here's my environment:
> 
> Linux version 2.6.16.29-xen_3.0.3.0
> Ruby 1.8.4 (2005-12-24 [i386-linux]
> Rails 1.2.3
> mongrel (1.0.1)
> mongrel_cluster (1.0.2, 0.2.1)
> ferret (0.11.4)
> acts_as_ferret stable plugin
> Ferret DRB server
> 
> 
> When I don't use an analyzer with my acts_as_ferret declaration,
> everything works fine.  However, I can't expect users to enter "Álex
> Rodríguez" when searching.. they're going to put "alex rodriguez"  (or
> some variation of his name, which I handle using a fuzzy search)
> 
> So then call an analyzer in my acts_as_ferret declaration:
> 
> 
> acts_as_ferret({ :fields => {:first_name => {:store => :no},
>                               :last_name => {:store => :no},
>                               :db_state => {:index =>
> :untokenized_omit_norms, :term_vector => :no}},
>                   :remote => true}, {:analyzer => UtfAnalyzer.new})
> 
> 
> Here's the analyzer I'm using... pretty much taken from from here:
> http://ferret.davebalmain.com/api/classes/Ferret/Analysis/MappingFilter.html
> 
> -----
> class UtfAnalyzer < Ferret::Analysis::Analyzer
>   include Ferret::Analysis
>   CHARACTER_MAPPINGS = {
>     ['à','á','â','ã','ä','å','ā','ă']         => 'a',
>     'æ'                                       => 'ae',
>     ['ď','đ']                                 => 'd',
>     ['ç','ć','č','ĉ','ċ']                     => 'c',
>     ['è','é','ê','ë','ē','ę','ě','ĕ','ė',]    => 'e',
>     ['ƒ']                                     => 'f',
>     ['ĝ','ğ','ġ','ģ']                         => 'g',
>     ['ĥ','ħ']                                 => 'h',
>     ['ì','ì','í','î','ï','ī','ĩ','ĭ']         => 'i',
>     ['į','ı','ĳ','ĵ']                         => 'j',
>     ['ķ','ĸ']                                 => 'k',
>     ['ł','ľ','ĺ','ļ','ŀ']                     => 'l',
>     ['ñ','ń','ň','ņ','ŉ','ŋ']                 => 'n',
>     ['ò','ó','ô','õ','ö','ø','ō','ő','ŏ','ŏ'] => 'o',
>     ['œ']                                     => 'oek',
>     ['ą']                                     => 'q',
>     ['ŕ','ř','ŗ']                             => 'r',
>     ['ś','š','ş','ŝ','ș']                     => 's',
>     ['ť','ţ','ŧ','ț']                         => 't',
>     ['ù','ú','û','ü','ū','ů','ű','ŭ','ũ','ų'] => 'u',
>     ['ŵ']                                     => 'w',
>     ['ý','ÿ','ŷ']                             => 'y',
>     ['ž','ż','ź']                             => 'z'
>   }
> 
>   def token_stream(field, str)
>     MappingFilter.new(StandardTokenizer.new(str), CHARACTER_MAPPINGS)
>   end
> 
> end
> 
> I think Ferret is working fine... because when I run some tests, the
> mapping filter correctly pulls out the accented characters... exactly as
> it should.
> 
> However, when something is persisted via the model (acts_as_ferret and
> DRB server), I get unexpected behavior...
> 
>  - using a model with ONE field declared in acts_as_ferret, and a string
> with accented characters -- I can search it as expected - with either
> accented or non-accented character, adn I get the results returned;
> however, I don't get any other results for the non-accented records.
> ONLY the accented records get returned when searching.
> 
>  - using a model with multiple characters defined (as in Player model
> above) -- nothing gets returned, neither accented or non-accented
> records, or any combination
> 
> My ferret_server.log file shows characters that are very different from
> the accented characters I'm trying to search on...
> 
> Search entered in form: Álex Rodríguez
> ferret_server.log: Ãlex rodrÃguez
> 
> Not sure why this is occuring, but I've also redisplayed the submitted
> text on a web page and it displays correctly.  This leads me to believe
> that Ruby/Rails is  successfully getting the information, and that html
> page encoding is correct, along with environment variables, etc..  As I
> stated earlier, my Ferret test takes the string "Rodríguez" and returns
> token["Rodriguez":0:10:1] demonstrating the UtfAnalyzer works fine
> outside of acts_as_ferret...
> 
> So any help here would be much appreciated.
> 
> Thanks,
> 
> Brandon
> -- 
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
-- 
Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Ferret DRB, UTF-8, Mongrel

Reply via email to