Re: Does Index have a Tokenizer Built into it

John Paul Sondag Fri, 13 Jul 2007 16:19:18 -0700

Ard,

I do have access to the URL's of the documents, but because I will be making
short snippets for many pages (suppose it had about 20 hits per page and I
need to make Snippets for each of them) I was worried it would be
inefficient to open each "hit" tokenize it and then make the Snippet, of
course the price of this may be worth the price of the increased Index
size.  I have been looking into storing "Field Vectors with positions" in
the index.  It seems that by doing this I will have access to everything
that the Tokenizer is giving me correct?   Will I need to store "term text"
in order to be able to access the actual term instead of stemmed words?


Thanks for all your help,

--JP

On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:


Hello,

> I'm wondering if after
> opening the
> index I can retrieve the Tokens (not the terms) of a
> document, something
> akin to IndexReader.Document(n).getTokenizer().

It is obviously not possible to get the original tokens of the document
back when you haven't stored the document, because:

1) the analyzer might have removed stop words in the first place
2) the terms in lucene index are perhaps stemmed words / synonyms / etc
etc
3) how would you expect things like spaces, commas, dots etc to be
restored?

And, I think what you want does not comply with an inverted index. When
you do not store the document, you always loose information about the
document during indexing/analyzing

How many documents are you talking about? They must be either somewhere on
FS or accessible over http...when you need the document, why not just
provide a link to the original location?

Regards Ard

>
> In summary:
>
> My current ( too wasteful implementation is this)
>
> StandardTokenizer(BufferedReader (
> IndexReader.Document(n).getField("text"
> )  )
>
> I'm wondering if Lucene has a more efficient manner to
> retrieve the tokens
> of a document from an index.  Because it seems like it has
> information about
> every "term" already, Since you can get retrieve a
> TermPositions object.
>
> Thanks,
>
>
> --JP
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Index have a Tokenizer Built into it

Reply via email to