Yes, indeed we could but it brings other problems, for example increasing the index size, and extending the query to search for multiple fields, etc.
On 5/25/07, Steven Rowe <[EMAIL PROTECTED]> wrote:
Hi Enis, Enis Soztutar wrote: > In nutch we have a use case in which we need to store tokens with their > original text plus their stemmed form plus their canonical form(through > some asciifization). From my understanding of lucene, it makes sense to > write a tokenstream which generates several tokens for each "word", but > place all the tokens for the "word" at the same position with > Token#setPositionIncrement(0). > This way we could be able to search over this field using any > form(stemmed, canonical, original) of the "word". Actually i have two > questions here. First is that is there any way to avoid matching stemmed > or canonical forms to a phrase query. Moreover it seems that adding > multiple forms of the "word"s alters statistical calculations for > scoring, especially for tf and idf, because the frequency of the root > form of the word is incremented at each word with that root form. Is > there any way that we could avoid it? Answering both questions: Couldn't you just use a different field for each form? -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]