Hi,
In nutch we have a use case in which we need to store tokens with their
original text plus their stemmed form plus their canonical form(through
some asciifization). From my understanding of lucene, it makes sense to
write a tokenstream which generates several tokens for each "word", but
place all the tokens for the "word" at the same position with
Token#setPositionIncrement(0).
This way we could be able to search over this field using any
form(stemmed, canonical, original) of the "word". Actually i have two
questions here. First is that is there any way to avoid matching stemmed
or canonical forms to a phrase query. Moreover it seems that adding
multiple forms of the "word"s alters statistical calculations for
scoring, especially for tf and idf, because the frequency of the root
form of the word is incremented at each word with that root form. Is
there any way that we could avoid it?
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]