On 4/4/07, Jens Kraemer <[EMAIL PROTECTED]> wrote:
> On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan King wrote:
> > On 4/3/07, Jens Kraemer <[EMAIL PROTECTED]> wrote:
> [..]
> > >
> > > The funny thing is that this does not necessarily mean that it doesn't
> > > work as intended. Just for fun I wrote an analyzer that completely
> > > ignores the input it should analyze, and always uses a fixed text
> > > instead:
> > >
> > > class TestAnalyzer
> > >   def token_stream field, input
> > >     ts = LetterTokenizer.new("senseless standard text")
> > >     puts "token_stream for :#{field} and input <#{input}>: 
> > > #{ts.inspect}\n #{ts.text}"
> > >     ts
> > >   end
> > > end
> > >
> > > a = TestAnalyzer.new
> > > ts = a.token_stream :test, 'foo bar'
> > > puts ts.text                           # 'senseless standard text' as 
> > > expected
> > >
> > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
> > > pfa[:test] = TestAnalyzer.new
> > > ts = pfa.token_stream :test, 'foo bar'
> > > puts ts.text                           # surprise: 'foo bar'
> > >
> > > I guess the pfa does not give the text to analyze via the token_stream
> > > method, but sets it later by using the Tokenizer's text=() method.
> >
> > I don't think so. I've tried overriding #text=, but it never gets called.
>
> ok, then it's happening somewhere else - in ferret's analysis.c there's
> a method a_standard_get_ts that clones an existing token stream instance
> and calls a method named reset on it, with the text to be tokenized.
>
> I guess we'll need Dave's help to sort this out...

Ok, I can see why this is confusing. To try and show you how it works,
try this code;

  require 'rubygems'
  require 'ferret'
  require 'pp'
  require 'strscan'

  include Ferret::Analysis
  include Ferret::Index

  class TestAnalyzer
    class TestTokenizer
      def initialize(input)
        puts "initialize => (#{input})"
        @input = input
      end
      def next()
        term, @input = @input, nil
        return term ? Token.new(term, 0, term.size) : nil
      end
      def text=(text)
        puts "reset => (#{text})"
        @input = text
      end
    end

    def token_stream field, input
      pp field
      pp input
      TestTokenizer.new(input)
    end
  end

  pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
  pfa[:test] = TestAnalyzer.new
  index = Index.new(:analyzer => pfa)
  index << {:test => 'foo'}
  index.search_each('bar')

The output is;

  :test
  ""
  initialize => ()
  r_analysis.c, 563: cwrts_reset #<= debugging bug :-0
  reset => (foo)
  :test
  "bar"
  initialize => (bar)

There is a stray debugging comment in there which I'm embarrassed I
didn't pick up earlier. But otherwise it should show you what is
happening. The tokenizer gets created with an empty string and then
TestTokenizer#text= gets called. This was actually an optimization for
multi-string fields. For example;

  index << {:test => ['one', 'two', 'three']}
  # =>
    initialize => ()
    reset => (one)
    reset => (two)
    reset => (three)

So the tokenizer only needs to be instantiated once and then it gets
reset for each string. This is good example of premature optimization,
particularly since most people will never even have multi-string
fields like this. Getting rid of this optimization makes things a lot
clearer. The next version of Ferret will give this output;

  index << {:test => ['one', 'two', 'three']}
  # =>
    initialize => (one)
    initialize => (two)
    initialize => (three)

So Ryan, you will now get the output you expect. It will require
updating to Ferret 0.11.4 though. Is there any reason this is a
problem?

Hope that helps,
Dave

-- 
Dave Balmain
http://www.davebalmain.com/
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to