Re: Dmitry's Term Vector stuff, plus some

Grant Ingersoll Tue, 24 Feb 2004 14:25:04 -0800

It is the location of the token in the document (see IndexReader.termPositions()).  
This information is already being stored in other parts of the index, it just isn't 
very efficient to get at it.

I think it would be useful to add to the IndexReader a way to get a list of positions 
given a term and a document, then we wouldn't have to store this info twice.  
Something like: 

TermPositions termPositions(Term term, Document doc);

which would return a subset of IndexReader.termPositions(Term term) containing only 
those Positions that are in the Document.  This would need to be implemented in an 
efficient manner, not just the brute force method of looping over termPositions(Term 
term).  I don't know how easy this would be to do, as I am not familiar with the file 
structure of the Position information.

At least that is my understanding of it, perhaps others have more insight.

-Grant

>>> [EMAIL PROTECTED] 02/24/04 04:20PM >>>
Doug Cutting wrote:

> Grant Ingersoll wrote:
> 
>> Do you see any reason to write position information at all for the 
>> term vectors?
> 
> 
> It could be useful to some folks.  If, for example, you only want to 
> expand a query with terms that occur near query terms, like automatic 
> phrase identification.  In general, the vector stuff is just a constant 
> factor improvement over re-tokenizing the text of the document, but 
> hopefully a substantial one.  If folks are doing computations which 
> require positional information, but don't require the actual text (e.g., 
> they don't need user-readable fragments) then positions could be handy.
> 
> But, certainly, most applications for term vectors do not need 
> positions, and I would not be upset if these were left out of the first 
> version.

Forgive me for being thick, however what position information are we talking about 
here? The start 
and end position of the token in the source text that the term came from? If so I 
think it would be 
useful to have them in at some point as I believe they could be used to optimized the 
query 
highlighting code that Mark Harwood contributed to not have to reanalyze the text 
every time one 
wanted to generate a highlighted search summary.

Regards,

Bruce Ritchie
http://www.jivesoftware.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Dmitry's Term Vector stuff, plus some

Reply via email to