Hi,

In nutch we have a use case in which we need to store tokens with their original text plus their stemmed form plus their canonical form(through some asciifization). From my understanding of lucene, it makes sense to write a tokenstream which generates several tokens for each "word", but place all the tokens for the "word" at the same position with Token#setPositionIncrement(0). This way we could be able to search over this field using any form(stemmed, canonical, original) of the "word". Actually i have two questions here. First is that is there any way to avoid matching stemmed or canonical forms to a phrase query. Moreover it seems that adding multiple forms of the "word"s alters statistical calculations for scoring, especially for tf and idf, because the frequency of the root form of the word is incremented at each word with that root form. Is there any way that we could avoid it?



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to