Hi, I'm working on a Ferret-based application which indexes content in all European languages. Thus, I have to deal with those funny European characters.
After googling a bit, I decided to move on with a custom European analyzer based on MappingFilter, as suggested in the Ferret rdoc. Everything works fine with Ferret 0.11.3 on Mac OS X. But this application needs to run on both Windows and Mac OS X. Since there's no mswin32 gem for 0.11.3, I decided to downgrade to 0.10.9 and replace MappingFilter with a custom-made filter as suggested by David in the following post. http://www.ruby-forum.com/topic/85299#156036 See the code I wrote at the bottom of this post. The token streams produced by this analyzer work fine in unit tests but the indexer fails to use them when a document is added. Here's the stack trace I get (on Mac OS X) wrong argument type Ferret::Analysis::ToASCIIFilter (expected Data) /usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in `text=' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in `add_document' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in `<<' /usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:252:in `<<' I tried several variants of the code (like avoid super and inheritance) but never with success. Therefore, I'm wondering whether 0.11.3 will be available soon on windows. Or if I can build this gem myself (I guess I'll need a Microsoft C compiler). Or if I can do things differently to get a European analyzer with 0.10.9. Thanks for your help. Laurent --------------------------------------------------------------------------- require 'ferret' require 'jcode' module Ferret::Analysis ACCENTUATED_CHARS = 'àáâãäåāăçćčĉċďđèéêëēęěĕėĝğġģĥħììíîïīĩĭįıijĵķĸłľĺļŀñńňņʼnŋòóôõöøōőŏŏąŕřŗśšşŝșťţŧțùúûüūůűŭũųŵýÿŷžżź' REPLACEMENT_CHARS = 'aaaaaaaacccccddeeeeeeeeegggghhiiiiiiiijjjjkklllllnnnnnnooooooooooqrrrsssssttttuuuuuuuuuuwyyyzzz' MAPPING = { ['à','á','â','ã','ä','å','ā','ă'] => 'a', 'æ' => 'ae', ['ď','đ'] => 'd', ['ç','ć','č','ĉ','ċ'] => 'c', ['è','é','ê','ë','ē','ę','ě','ĕ','ė'] => 'e', ['ƒ'] => 'f', ['ĝ','ğ','ġ','ģ'] => 'g', ['ĥ','ħ'] => 'h', ['ì','ì','í','î','ï','ī','ĩ','ĭ'] => 'i', ['į','ı','ij','ĵ'] => 'j', ['ķ','ĸ'] => 'k', ['ł','ľ','ĺ','ļ','ŀ'] => 'l', ['ñ','ń','ň','ņ','ʼn','ŋ'] => 'n', ['ò','ó','ô','õ','ö','ø','ō','ő','ŏ','ŏ'] => 'o', ['œ'] => 'oek', ['ą'] => 'q', ['ŕ','ř','ŗ'] => 'r', ['ś','š','ş','ŝ','ș'] => 's', ['ť','ţ','ŧ','ț'] => 't', ['ù','ú','û','ü','ū','ů','ű','ŭ','ũ','ų'] => 'u', ['ŵ'] => 'w', ['ý','ÿ','ŷ'] => 'y', ['ž','ż','ź'] => 'z' } class TokenFilter < TokenStream # Construct a token stream filtering the given input. def initialize(input) @input = input end end # replace accentuated chars with ASCII one class ToASCIIFilter < TokenFilter def next() token = @input.next() unless token.nil? token.text = token.text.tr(ACCENTUATED_CHARS, REPLACEMENT_CHARS) end token end end class EuropeanAnalyzer < StandardAnalyzer def token_stream(field, string) if defined?(MappingFilter) return MappingFilter.new(super, MAPPING) # 0.11.x else return ToASCIIFilter.new(super) # 0.10.x end end end end -- Posted via http://www.ruby-forum.com/. _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

