First, thanks to Jens K. for pointing a stupid error on my part regarding
the use of test_token_stream().
My current problem, a custom tokenizer I've written in Ruby does not
properly create an index (or at least searches on the index don't work).
Using test_token_stream() I have verified that my tokenizer properly creates
the token_stream; certainly each Token's attributes are set properly.
Nevertheless, simple searches return zero results.
The essence of my tokenizer is to skip beyond XML tags in a file and break
up and return text components as tokens. I use this approach as opposed to
an Hpricot approach because I need to keep track of the location of the text
with respect to XML tags since after a search for a phrase I'll want to
extract the nearby XML tags as they contain important context. My tokenizer
(XMLTokenizer) contains a the obligatory initialize, next and text methods
(shown below) as well as a lot of parsing methods that are called at the top
level by the method XMLTokenizer.get_next_token which is the primary action
within next. I didn't add the details of get_next_token as I'm assuming that
if each token produced by get_next_token has the proper attributes then it
shouldn't be the cause of the problem. What more should I be looking for?
I've been looking for a custom tokenizer written in Ruby to model after; any
suggestions?
def initialize(xmlText)
@xmlText = xmlText.gsub(/[;,!]/, ' ')
@currPtr = 0
@currWordStart = nil
@currTextStart = 0
@nextTagStart = 0
@startOfTextRegion = 0
@currTextStart = \
XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText)
@nextTagStart = \
XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText)
@currPtr = @currTextStart
@startOfTextRegion = 1
end
def next
tkn = get_next_token
if tkn != nil
puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc,
tkn.text]
end
return tkn
end
def text=(text)
initialize(text)
@xmlText
end
Below is text from a previous, related message that shows that StopFiltering
is not working:
>* I've written a tokenizer/analyzer that parses a file extracting tokens and
*>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the
*>* tokenizer skips over XML elements but maintains relative positioning). I've
*>* written many units tests to check the produced token stream and was
*>* confident that the tokenizer was working properly. Then I noticed two
*>* problems:
*>*
*>* 1. StopFilter (using English stop words) does not properly filter the
*>* token stream output from my tokenizer. If I explicitly pass an
array of stop
*>* words to the stop filter it still doesn't work. If I simply switch my
*>* tokenizer to a StandardTokenizer the stop words are
appropriately filtered
*>* (of course the XML tags are treated differently).
*>
>* 2. When I try a simple search no results come up. I can see that my
*>* tokenizer is adding files to the index but a simple search (using
*>* Ferret::Index::Index.search_each) produces no results.
*
Any suggestions are appreciated.
John
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk