RE: Does Index have a Tokenizer Built into it

Ard Schrijvers Mon, 16 Jul 2007 01:26:53 -0700

Hello,

> Ard,
> 
> I do have access to the URL's of the documents, but because I 
> will be making
> short snippets for many pages (suppose it had about 20 hits 
> per page and I
> need to make Snippets for each of them) I was worried it would be
> inefficient to open each "hit" tokenize it and then make the 
> Snippet, of


Yes, getting all the documents over http just to get the snippet, for example 
the first 2 lines, is really bad for your performance in search overviews.

Logically, what you want to show, you need to store in your index. For example, 
if for search hits you need to show the title and subtitle, just store these 
two in the index. If you want to have a google like highlighter of text 
snippets where the term occured, you need to store the entire text IIRC (see 
HighlighterTest in lucene). 

How many docs are you talking about that you cannot store the entire content? 

You could also just index the content and not store it, and in another lucene 
field, store the first 2 or 3 lines of  the document, which serve as text 
snippet. Making correct extracts of text snippets is very hard (see lingpipe 
for example)

Regards Ard

> course the price of this may be worth the price of the increased Index
> size.  I have been looking into storing "Field Vectors with 
> positions" in
> the index.  It seems that by doing this I will have access to 
> everything
> that the Tokenizer is giving me correct?   Will I need to 
> store "term text"
> in order to be able to access the actual term instead of 
> stemmed words?
> 
> Thanks for all your help,
> 
> --JP
> 
> On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
> >
> > > I'm wondering if after
> > > opening the
> > > index I can retrieve the Tokens (not the terms) of a
> > > document, something
> > > akin to IndexReader.Document(n).getTokenizer().
> >
> > It is obviously not possible to get the original tokens of 
> the document
> > back when you haven't stored the document, because:
> >
> > 1) the analyzer might have removed stop words in the first place
> > 2) the terms in lucene index are perhaps stemmed words / 
> synonyms / etc
> > etc
> > 3) how would you expect things like spaces, commas, dots etc to be
> > restored?
> >
> > And, I think what you want does not comply with an inverted 
> index. When
> > you do not store the document, you always loose information 
> about the
> > document during indexing/analyzing
> >
> > How many documents are you talking about? They must be 
> either somewhere on
> > FS or accessible over http...when you need the document, 
> why not just
> > provide a link to the original location?
> >
> > Regards Ard
> >
> > >
> > > In summary:
> > >
> > > My current ( too wasteful implementation is this)
> > >
> > > StandardTokenizer(BufferedReader (
> > > IndexReader.Document(n).getField("text"
> > > )  )
> > >
> > > I'm wondering if Lucene has a more efficient manner to
> > > retrieve the tokens
> > > of a document from an index.  Because it seems like it has
> > > information about
> > > every "term" already, Since you can get retrieve a
> > > TermPositions object.
> > >
> > > Thanks,
> > >
> > >
> > > --JP
> > >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Does Index have a Tokenizer Built into it

Reply via email to