On 4/5/07, David Balmain <[EMAIL PROTECTED]> wrote:
> On 4/4/07, Jens Kraemer <[EMAIL PROTECTED]> wrote:
> > On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan King wrote:
> > > On 4/3/07, Jens Kraemer <[EMAIL PROTECTED]> wrote:
> > [..]
> > > >
> > > > The funny thing is that this does not necessarily mean that it doesn't
> > > > work as intended. Just for fun I wrote an analyzer that completely
> > > > ignores the input it should analyze, and always uses a fixed text
> > > > instead:
> > > >
> > > > class TestAnalyzer
> > > > def token_stream field, input
> > > > ts = LetterTokenizer.new("senseless standard text")
> > > > puts "token_stream for :#{field} and input <#{input}>:
> > > > #{ts.inspect}\n #{ts.text}"
> > > > ts
> > > > end
> > > > end
> > > >
> > > > a = TestAnalyzer.new
> > > > ts = a.token_stream :test, 'foo bar'
> > > > puts ts.text # 'senseless standard text' as
> > > > expected
> > > >
> > > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
> > > > pfa[:test] = TestAnalyzer.new
> > > > ts = pfa.token_stream :test, 'foo bar'
> > > > puts ts.text # surprise: 'foo bar'
> > > >
> > > > I guess the pfa does not give the text to analyze via the token_stream
> > > > method, but sets it later by using the Tokenizer's text=() method.
> > >
> > > I don't think so. I've tried overriding #text=, but it never gets called.
> >
> > ok, then it's happening somewhere else - in ferret's analysis.c there's
> > a method a_standard_get_ts that clones an existing token stream instance
> > and calls a method named reset on it, with the text to be tokenized.
> >
> > I guess we'll need Dave's help to sort this out...
>
> Ok, I can see why this is confusing. To try and show you how it works,
> try this code;
>
> require 'rubygems'
> require 'ferret'
> require 'pp'
> require 'strscan'
>
> include Ferret::Analysis
> include Ferret::Index
>
> class TestAnalyzer
> class TestTokenizer
> def initialize(input)
> puts "initialize => (#{input})"
> @input = input
> end
> def next()
> term, @input = @input, nil
> return term ? Token.new(term, 0, term.size) : nil
> end
> def text=(text)
> puts "reset => (#{text})"
> @input = text
> end
> end
>
> def token_stream field, input
> pp field
> pp input
> TestTokenizer.new(input)
> end
> end
>
> pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
> pfa[:test] = TestAnalyzer.new
> index = Index.new(:analyzer => pfa)
> index << {:test => 'foo'}
> index.search_each('bar')
>
> The output is;
>
> :test
> ""
> initialize => ()
> r_analysis.c, 563: cwrts_reset #<= debugging bug :-0
> reset => (foo)
> :test
> "bar"
> initialize => (bar)
>
> There is a stray debugging comment in there which I'm embarrassed I
> didn't pick up earlier. But otherwise it should show you what is
> happening. The tokenizer gets created with an empty string and then
> TestTokenizer#text= gets called. This was actually an optimization for
> multi-string fields. For example;
>
> index << {:test => ['one', 'two', 'three']}
> # =>
> initialize => ()
> reset => (one)
> reset => (two)
> reset => (three)
>
> So the tokenizer only needs to be instantiated once and then it gets
> reset for each string. This is good example of premature optimization,
> particularly since most people will never even have multi-string
> fields like this. Getting rid of this optimization makes things a lot
> clearer. The next version of Ferret will give this output;
>
> index << {:test => ['one', 'two', 'three']}
> # =>
> initialize => (one)
> initialize => (two)
> initialize => (three)
>
> So Ryan, you will now get the output you expect. It will require
> updating to Ferret 0.11.4 though. Is there any reason this is a
> problem?
I'm at the point where I need to upgrade for other reason anyway, so
it shouldn't be a problem.
Thanks for your help.
-ryan
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk