Re: multiple tokens at the same position

2007-05-25 Thread Mark Miller
Another (obvious) option is to use two indexes and direct the query to the appropriate index depending on the search specification. Of course you double your space requirements, but your basically going to do that anyway if you use two fields. I chose this for the slight benefit of fewer fields on

Re: multiple tokens at the same position

2007-05-25 Thread Enis Soztutar
On 5/25/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Yes, indeed we could but it brings other problems, for example increasing : the index size, and extending the query to search for multiple fields, etc. 1) if you index both teh raw and stemmed forms your index is going to grow to roughly

Re: multiple tokens at the same position

2007-05-25 Thread Chris Hostetter
: Yes, indeed we could but it brings other problems, for example increasing : the index size, and extending the query to search for multiple fields, etc. 1) if you index both teh raw and stemmed forms your index is going to grow to roughly the same size regardless of wether the stem and the arw a

Re: multiple tokens at the same position

2007-05-25 Thread Erick Erickson
I can only speak to the " avoid matching stemmed or canonical forms" part... Yes, but you've got to do some fancy dancing when you index, something like adding a special signifier to, say, the original token. I'll ignore the canonical part of your question for the sake of brevity. Consider inde

Re: multiple tokens at the same position

2007-05-25 Thread Enis Soztutar
Yes, indeed we could but it brings other problems, for example increasing the index size, and extending the query to search for multiple fields, etc. On 5/25/07, Steven Rowe <[EMAIL PROTECTED]> wrote: Hi Enis, Enis Soztutar wrote: > In nutch we have a use case in which we need to store tokens

Re: multiple tokens at the same position

2007-05-25 Thread Steven Rowe
Hi Enis, Enis Soztutar wrote: > In nutch we have a use case in which we need to store tokens with their > original text plus their stemmed form plus their canonical form(through > some asciifization). From my understanding of lucene, it makes sense to > write a tokenstream which generates several

multiple tokens at the same position

2007-05-25 Thread Enis Soztutar
Hi, In nutch we have a use case in which we need to store tokens with their original text plus their stemmed form plus their canonical form(through some asciifization). From my understanding of lucene, it makes sense to write a tokenstream which generates several tokens for each "word", but p