Hi, I've been looking into the indexing documents with the vectors for terms and positions on to solve my problem. However, I've run into a bit of a snag. After indexing I have been able to retrieve the TermPositionVector from the index and it has all of the data, but I cannot find a way where given a position I can retrieve the term at that position. Which is how I was hoping to create my contextual snippets.
They have functions where given a term you can get it's position but I see no method to achieve the reverse affect. Is there another class I need to use for this? --JP On 7/16/07, John Paul Sondag <[EMAIL PROTECTED]> wrote:
Some of the data sets that will be using have about 2 TB of data (90 million web pages). The Snippet I will be generating I would like to include the words that are being queried, so I don't want to simply store the first 2 or 3 lines. I have looked at the HighlighterTest and I do believe that it requires the entire text of the document. However, unlike the highlighter I know where the termOffset in the document. The input to my Snippet will be a vector of querywords and their offsets in the document. (not their position in the document). I'm reading about the option "term vectors" I can store while indexing my data. It seems to be much more efficient than storing the entire document, I'm just not sure if the "term offset" is the same as a "token offset". Here's what I'm reading in case I'm totally off the ball here and this is useless to me: http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors It seems like this has all the information that I would have if I tokenized the document anyways, or am I missing something? Thanks again for all the help! --JP On 7/16/07, Ard Schrijvers < [EMAIL PROTECTED]> wrote: > > Hello, > > > Ard, > > > > I do have access to the URL's of the documents, but because I > > will be making > > short snippets for many pages (suppose it had about 20 hits > > per page and I > > need to make Snippets for each of them) I was worried it would be > > inefficient to open each "hit" tokenize it and then make the > > Snippet, of > > Yes, getting all the documents over http just to get the snippet, for > example the first 2 lines, is really bad for your performance in search > overviews. > > Logically, what you want to show, you need to store in your index. For > example, if for search hits you need to show the title and subtitle, just > store these two in the index. If you want to have a google like highlighter > of text snippets where the term occured, you need to store the entire text > IIRC (see HighlighterTest in lucene). > > How many docs are you talking about that you cannot store the entire > content? > > You could also just index the content and not store it, and in another > lucene field, store the first 2 or 3 lines of the document, which serve as > text snippet. Making correct extracts of text snippets is very hard (see > lingpipe for example) > > Regards Ard > > > course the price of this may be worth the price of the increased Index > > size. I have been looking into storing "Field Vectors with > > positions" in > > the index. It seems that by doing this I will have access to > > everything > > that the Tokenizer is giving me correct? Will I need to > > store "term text" > > in order to be able to access the actual term instead of > > stemmed words? > > > > Thanks for all your help, > > > > --JP > > > > On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote: > > > > > > Hello, > > > > > > > I'm wondering if after > > > > opening the > > > > index I can retrieve the Tokens (not the terms) of a > > > > document, something > > > > akin to IndexReader.Document (n).getTokenizer(). > > > > > > It is obviously not possible to get the original tokens of > > the document > > > back when you haven't stored the document, because: > > > > > > 1) the analyzer might have removed stop words in the first place > > > 2) the terms in lucene index are perhaps stemmed words / > > synonyms / etc > > > etc > > > 3) how would you expect things like spaces, commas, dots etc to be > > > restored? > > > > > > And, I think what you want does not comply with an inverted > > index. When > > > you do not store the document, you always loose information > > about the > > > document during indexing/analyzing > > > > > > How many documents are you talking about? They must be > > either somewhere on > > > FS or accessible over http...when you need the document, > > why not just > > > provide a link to the original location? > > > > > > Regards Ard > > > > > > > > > > > In summary: > > > > > > > > My current ( too wasteful implementation is this) > > > > > > > > StandardTokenizer(BufferedReader ( > > > > IndexReader.Document(n).getField("text" > > > > ) ) > > > > > > > > I'm wondering if Lucene has a more efficient manner to > > > > retrieve the tokens > > > > of a document from an index. Because it seems like it has > > > > information about > > > > every "term" already, Since you can get retrieve a > > > > TermPositions object. > > > > > > > > Thanks, > > > > > > > > > > > > --JP > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >