On 4/5/07, David Balmain <[EMAIL PROTECTED]> wrote:
> On 4/4/07, Jens Kraemer <[EMAIL PROTECTED]> wrote:
> > On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan King wrote:
> > > On 4/3/07, Jens Kraemer <[EMAIL PROTECTED]> wrote:
> > [..]
> > > >
> > > > The funny thing is that this does not necessarily mean that it doesn't
> > > > work as intended. Just for fun I wrote an analyzer that completely
> > > > ignores the input it should analyze, and always uses a fixed text
> > > > instead:
> > > >
> > > > class TestAnalyzer
> > > >   def token_stream field, input
> > > >     ts = LetterTokenizer.new("senseless standard text")
> > > >     puts "token_stream for :#{field} and input <#{input}>: 
> > > > #{ts.inspect}\n #{ts.text}"
> > > >     ts
> > > >   end
> > > > end
> > > >
> > > > a = TestAnalyzer.new
> > > > ts = a.token_stream :test, 'foo bar'
> > > > puts ts.text                           # 'senseless standard text' as 
> > > > expected
> > > >
> > > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
> > > > pfa[:test] = TestAnalyzer.new
> > > > ts = pfa.token_stream :test, 'foo bar'
> > > > puts ts.text                           # surprise: 'foo bar'
> > > >
> > > > I guess the pfa does not give the text to analyze via the token_stream
> > > > method, but sets it later by using the Tokenizer's text=() method.
> > >
> > > I don't think so. I've tried overriding #text=, but it never gets called.
> >
> > ok, then it's happening somewhere else - in ferret's analysis.c there's
> > a method a_standard_get_ts that clones an existing token stream instance
> > and calls a method named reset on it, with the text to be tokenized.
> >
> > I guess we'll need Dave's help to sort this out...
>
> Ok, I can see why this is confusing. To try and show you how it works,
> try this code;
>
>   require 'rubygems'
>   require 'ferret'
>   require 'pp'
>   require 'strscan'
>
>   include Ferret::Analysis
>   include Ferret::Index
>
>   class TestAnalyzer
>     class TestTokenizer
>       def initialize(input)
>         puts "initialize => (#{input})"
>         @input = input
>       end
>       def next()
>         term, @input = @input, nil
>         return term ? Token.new(term, 0, term.size) : nil
>       end
>       def text=(text)
>         puts "reset => (#{text})"
>         @input = text
>       end
>     end
>
>     def token_stream field, input
>       pp field
>       pp input
>       TestTokenizer.new(input)
>     end
>   end
>
>   pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
>   pfa[:test] = TestAnalyzer.new
>   index = Index.new(:analyzer => pfa)
>   index << {:test => 'foo'}
>   index.search_each('bar')
>
> The output is;
>
>   :test
>   ""
>   initialize => ()
>   r_analysis.c, 563: cwrts_reset #<= debugging bug :-0
>   reset => (foo)
>   :test
>   "bar"
>   initialize => (bar)
>
> There is a stray debugging comment in there which I'm embarrassed I
> didn't pick up earlier. But otherwise it should show you what is
> happening. The tokenizer gets created with an empty string and then
> TestTokenizer#text= gets called. This was actually an optimization for
> multi-string fields. For example;
>
>   index << {:test => ['one', 'two', 'three']}
>   # =>
>     initialize => ()
>     reset => (one)
>     reset => (two)
>     reset => (three)
>
> So the tokenizer only needs to be instantiated once and then it gets
> reset for each string. This is good example of premature optimization,
> particularly since most people will never even have multi-string
> fields like this. Getting rid of this optimization makes things a lot
> clearer. The next version of Ferret will give this output;
>
>   index << {:test => ['one', 'two', 'three']}
>   # =>
>     initialize => (one)
>     initialize => (two)
>     initialize => (three)
>
> So Ryan, you will now get the output you expect. It will require
> updating to Ferret 0.11.4 though. Is there any reason this is a
> problem?

I'm at the point where I need to upgrade for other reason anyway, so
it shouldn't be a problem.

Thanks for your help.

-ryan
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to