Hi,

I'm working on a Ferret-based application which indexes content in all
European languages. Thus, I have to deal with those funny European
characters.

After googling a bit, I decided to move on with a custom European
analyzer based on MappingFilter, as suggested in the Ferret rdoc.
Everything works fine with Ferret 0.11.3 on Mac OS X.

But this application needs to run on both Windows and Mac OS X. Since
there's no mswin32 gem for 0.11.3, I decided to downgrade to 0.10.9 and
replace MappingFilter with a custom-made filter as suggested by David in
the following post.

http://www.ruby-forum.com/topic/85299#156036

See the code I wrote at the bottom of this post. The token streams
produced by this analyzer work fine in unit tests but the indexer fails
to use them when a document is added. Here's the stack trace I get (on
Mac OS X)

 wrong argument type Ferret::Analysis::ToASCIIFilter (expected Data)
 /usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in
`text='
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in
`add_document'
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in
`<<'
/usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize'
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:252:in
`<<'

I tried several variants of the code (like avoid super and inheritance)
but never with success.

Therefore, I'm wondering whether 0.11.3 will be available soon on
windows.
Or if I can build this gem myself (I guess I'll need a Microsoft C
compiler).
Or if I can do things differently to get a European analyzer with
0.10.9.

Thanks for your help.
Laurent

---------------------------------------------------------------------------
require 'ferret'
require 'jcode'

module Ferret::Analysis

    ACCENTUATED_CHARS =
'àáâãäåāăçćčĉċďđèéêëēęěĕėĝğġģĥħììíîïīĩĭįıijĵķĸłľĺļŀñńňņʼnŋòóôõöøōőŏŏąŕřŗśšşŝșťţŧțùúûüūůűŭũųŵýÿŷžżź'
    REPLACEMENT_CHARS =
'aaaaaaaacccccddeeeeeeeeegggghhiiiiiiiijjjjkklllllnnnnnnooooooooooqrrrsssssttttuuuuuuuuuuwyyyzzz'


    MAPPING = {
      ['à','á','â','ã','ä','å','ā','ă']         => 'a',
      'æ'                                       => 'ae',
      ['ď','đ']                                 => 'd',
      ['ç','ć','č','ĉ','ċ']                     => 'c',
      ['è','é','ê','ë','ē','ę','ě','ĕ','ė']     => 'e',
      ['ƒ']                                     => 'f',
      ['ĝ','ğ','ġ','ģ']                         => 'g',
      ['ĥ','ħ']                                 => 'h',
      ['ì','ì','í','î','ï','ī','ĩ','ĭ']         => 'i',
      ['į','ı','ij','ĵ']                         => 'j',
      ['ķ','ĸ']                                 => 'k',
      ['ł','ľ','ĺ','ļ','ŀ']                     => 'l',
      ['ñ','ń','ň','ņ','ʼn','ŋ']                 => 'n',
      ['ò','ó','ô','õ','ö','ø','ō','ő','ŏ','ŏ'] => 'o',
      ['œ']                                     => 'oek',
      ['ą']                                     => 'q',
      ['ŕ','ř','ŗ']                             => 'r',
      ['ś','š','ş','ŝ','ș']                     => 's',
      ['ť','ţ','ŧ','ț']                         => 't',
      ['ù','ú','û','ü','ū','ů','ű','ŭ','ũ','ų'] => 'u',
      ['ŵ']                                     => 'w',
      ['ý','ÿ','ŷ']                             => 'y',
      ['ž','ż','ź']                             => 'z'
    }

  class TokenFilter < TokenStream
    # Construct a token stream filtering the given input.
    def initialize(input)
      @input = input
    end
  end

  # replace accentuated chars with ASCII one
  class ToASCIIFilter < TokenFilter

    def next()
      token = @input.next()
      unless token.nil?
        token.text = token.text.tr(ACCENTUATED_CHARS, REPLACEMENT_CHARS)
      end
      token
    end
  end
  class EuropeanAnalyzer < StandardAnalyzer

    def token_stream(field, string)
      if defined?(MappingFilter)
        return MappingFilter.new(super, MAPPING) # 0.11.x
      else
        return ToASCIIFilter.new(super) # 0.10.x
      end
    end
  end
end

-- 
Posted via http://www.ruby-forum.com/.
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to