I can only speak to the " avoid matching stemmed
or canonical forms" part...

Yes, but you've got to do some fancy dancing when you index,
something like adding a special signifier to, say, the original token.
I'll ignore the canonical part of your question for the sake of
brevity.


Consider indexing "running"
You'd index "run" and "running$".

Now, whenever you care about the original token, you append
the '$' to the term and search for that.

This has one other advantage. Say you index the term "run" with
the above. If you don't do something like adding the $ to the
original, you can't distinguish between getting a hit on the
stem or not. That is, you can't distinguish between getting a hit
where the original word was "run" and one where the original
was "running". This may be important for "exact match".

Best
Erick

On 5/25/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:

Hi,

In nutch we have a use case in which we need to store tokens with their
original text plus their stemmed form plus their canonical form(through
some asciifization). From my understanding of lucene, it makes sense to
write a tokenstream which generates several tokens for each "word", but
place all the tokens for the "word" at the same position with
Token#setPositionIncrement(0).
This way we could be able to search over this field using any
form(stemmed, canonical, original) of the "word". Actually i have two
questions here. First is that is there any way to avoid matching stemmed
or canonical forms to a phrase query. Moreover it seems that adding
multiple forms of the "word"s alters statistical calculations for
scoring, especially for tf and idf, because the frequency of the root
form of the word is incremented at each word with that root form. Is
there any way that we could avoid it?



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to