Another (obvious) option is to use two indexes and direct the query to the
appropriate index depending on the search specification. Of course you
double your space requirements, but your basically going to do that anyway
if you use two fields. I chose this for the slight benefit of fewer fields
on
On 5/25/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: Yes, indeed we could but it brings other problems, for example
increasing
: the index size, and extending the query to search for multiple fields,
etc.
1) if you index both teh raw and stemmed forms your index is going to grow
to roughly
: Yes, indeed we could but it brings other problems, for example increasing
: the index size, and extending the query to search for multiple fields, etc.
1) if you index both teh raw and stemmed forms your index is going to grow
to roughly the same size regardless of wether the stem and the arw a
I can only speak to the " avoid matching stemmed
or canonical forms" part...
Yes, but you've got to do some fancy dancing when you index,
something like adding a special signifier to, say, the original token.
I'll ignore the canonical part of your question for the sake of
brevity.
Consider inde
Yes, indeed we could but it brings other problems, for example increasing
the index size, and extending the query to search for multiple fields, etc.
On 5/25/07, Steven Rowe <[EMAIL PROTECTED]> wrote:
Hi Enis,
Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens
Hi Enis,
Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several
Hi,
In nutch we have a use case in which we need to store tokens with their
original text plus their stemmed form plus their canonical form(through
some asciifization). From my understanding of lucene, it makes sense to
write a tokenstream which generates several tokens for each "word", but
p