[Ferret-talk] Custom Tokenizer not working

S D Wed, 23 Apr 2008 09:24:34 -0700

First, thanks to Jens K. for pointing a stupid error on my part regarding
the use of test_token_stream().


My current problem, a custom tokenizer I've written in Ruby does not
properly create an index (or at least searches on the index don't work).
Using test_token_stream() I have verified that my tokenizer properly creates
the token_stream; certainly each Token's attributes are set properly.
Nevertheless, simple searches return zero results.

The essence of my tokenizer is to skip beyond XML tags in a file and break
up and return text components as tokens. I use this approach as opposed to
an Hpricot approach because I need to keep track of the location of the text
with respect to XML tags since after a search for a phrase I'll want to
extract the nearby XML tags as they contain important context. My tokenizer
(XMLTokenizer) contains a the obligatory initialize, next and text methods
(shown below) as well as a lot of parsing methods that are called at the top
level by the method XMLTokenizer.get_next_token which is the primary action
within next. I didn't add the details of get_next_token as I'm assuming that
if each token produced by get_next_token has the proper attributes then it
shouldn't be the cause of the problem. What more should I be looking for?
I've been looking for a custom tokenizer written in Ruby to model after; any
suggestions?

    def initialize(xmlText)
      @xmlText = xmlText.gsub(/[;,!]/, ' ')
      @currPtr = 0
      @currWordStart = nil
      @currTextStart = 0
      @nextTagStart = 0
      @startOfTextRegion = 0

      @currTextStart = \
        XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText)
      @nextTagStart = \
        XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText)
      @currPtr = @currTextStart
      @startOfTextRegion = 1
    end

    def next
      tkn = get_next_token
      if tkn != nil
        puts "%5d |%4d |%5d   | %s" % [tkn.start, tkn.end, tkn.pos_inc,
tkn.text]
      end
      return tkn
    end

    def text=(text)
      initialize(text)
      @xmlText
    end

Below is text from a previous, related message that shows that StopFiltering
is not working:

>* I've written a tokenizer/analyzer that parses a file extracting tokens and
*>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the
*>* tokenizer skips over XML elements but maintains relative positioning). I've
*>* written many units tests to check the produced token stream and was
*>* confident that the tokenizer was working properly. Then I noticed two
*>* problems:
*>*
*>*    1. StopFilter (using English stop words) does not properly filter the
*>*    token stream output from my tokenizer. If I explicitly pass an
array of stop
*>*    words to the stop filter it still doesn't work. If I simply switch my
*>*    tokenizer to a StandardTokenizer the stop words are
appropriately filtered
*>*    (of course the XML tags are treated differently).
*>
>*    2. When I try a simple search no results come up. I can see that my
*>*    tokenizer is adding files to the index but a simple search (using
*>*    Ferret::Index::Index.search_each) produces no results.
*


Any suggestions are appreciated.

John

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

[Ferret-talk] Custom Tokenizer not working

Reply via email to