Re: Does Index have a Tokenizer Built into it

John Paul Sondag Tue, 17 Jul 2007 10:21:30 -0700

Hi,

I've been looking into the indexing documents with the vectors for terms and
positions on to solve my problem.  However, I've run into a bit of a snag.
After indexing I have been able to retrieve the TermPositionVector from the
index and it has all of the data, but I cannot find a way where given a
position I can retrieve the term at that position. Which is how I was hoping
to create my contextual snippets.


They have functions where given a term you can get it's position but I see
no method to achieve the reverse affect.  Is there another class I need to
use for this?

--JP

On 7/16/07, John Paul Sondag <[EMAIL PROTECTED]> wrote:


Some of the data sets that will be using have about 2 TB of data (90
million web pages).  The Snippet I will be generating I would like to
include the words that are being queried, so I don't want to simply store
the first 2 or 3 lines.  I have looked at the HighlighterTest and I do
believe that it requires the entire text of the document.  However, unlike
the highlighter I know where the termOffset in the document.

The input to my Snippet will be a vector of querywords and their offsets
in the document.  (not their position in the document).  I'm reading about
the option "term vectors" I can store while indexing my data.  It seems to
be much more efficient than storing the entire document, I'm just not sure
if the "term offset" is the same as a "token offset".  Here's what I'm
reading in case I'm totally off the ball here and this is useless to me:

http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors

It seems like this has all the information that I would have if I
tokenized the document anyways, or am I missing something?

Thanks again for all the help!

--JP




On 7/16/07, Ard Schrijvers < [EMAIL PROTECTED]> wrote:
>
> Hello,
>
> > Ard,
> >
> > I do have access to the URL's of the documents, but because I
> > will be making
> > short snippets for many pages (suppose it had about 20 hits
> > per page and I
> > need to make Snippets for each of them) I was worried it would be
> > inefficient to open each "hit" tokenize it and then make the
> > Snippet, of
>
> Yes, getting all the documents over http just to get the snippet, for
> example the first 2 lines, is really bad for your performance in search
> overviews.
>
> Logically, what you want to show, you need to store in your index. For
> example, if for search hits you need to show the title and subtitle, just
> store these two in the index. If you want to have a google like highlighter
> of text snippets where the term occured, you need to store the entire text
> IIRC (see HighlighterTest in lucene).
>
> How many docs are you talking about that you cannot store the entire
> content?
>
> You could also just index the content and not store it, and in another
> lucene field, store the first 2 or 3 lines of  the document, which serve as
> text snippet. Making correct extracts of text snippets is very hard (see
> lingpipe for example)
>
> Regards Ard
>
> > course the price of this may be worth the price of the increased Index
> > size.  I have been looking into storing "Field Vectors with
> > positions" in
> > the index.  It seems that by doing this I will have access to
> > everything
> > that the Tokenizer is giving me correct?   Will I need to
> > store "term text"
> > in order to be able to access the actual term instead of
> > stemmed words?
> >
> > Thanks for all your help,
> >
> > --JP
> >
> > On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> > >
> > > Hello,
> > >
> > > > I'm wondering if after
> > > > opening the
> > > > index I can retrieve the Tokens (not the terms) of a
> > > > document, something
> > > > akin to IndexReader.Document (n).getTokenizer().
> > >
> > > It is obviously not possible to get the original tokens of
> > the document
> > > back when you haven't stored the document, because:
> > >
> > > 1) the analyzer might have removed stop words in the first place
> > > 2) the terms in lucene index are perhaps stemmed words /
> > synonyms / etc
> > > etc
> > > 3) how would you expect things like spaces, commas, dots etc to be
> > > restored?
> > >
> > > And, I think what you want does not comply with an inverted
> > index. When
> > > you do not store the document, you always loose information
> > about the
> > > document during indexing/analyzing
> > >
> > > How many documents are you talking about? They must be
> > either somewhere on
> > > FS or accessible over http...when you need the document,
> > why not just
> > > provide a link to the original location?
> > >
> > > Regards Ard
> > >
> > > >
> > > > In summary:
> > > >
> > > > My current ( too wasteful implementation is this)
> > > >
> > > > StandardTokenizer(BufferedReader (
> > > > IndexReader.Document(n).getField("text"
> > > > )  )
> > > >
> > > > I'm wondering if Lucene has a more efficient manner to
> > > > retrieve the tokens
> > > > of a document from an index.  Because it seems like it has
> > > > information about
> > > > every "term" already, Since you can get retrieve a
> > > > TermPositions object.
> > > >
> > > > Thanks,
> > > >
> > > >
> > > > --JP
> > > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Does Index have a Tokenizer Built into it

Reply via email to