Yes, indeed we could but it brings other problems, for example increasing
the index size, and extending the query to search for multiple fields, etc.

On 5/25/07, Steven Rowe <[EMAIL PROTECTED]> wrote:

Hi Enis,

Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?

Answering both questions: Couldn't you just use a different field for
each form?

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to