I am sensing a common theme throughout a variety of threads here: Namely, a need for a pluggable set of Reader's and Writers (think Interface) that can write metadata about an Index/Document/Field/Term (which I see the TermVector stuff as being a instance of) and can be given to Lucene from the application level (or at least the application specifies which ones to use)
I proposed something like this a bit earlier, but didn't see any interest. I suppose I should implement it as this is how things get going, but would be nice to have some input on requirements and whether the people who know Lucene better than I think this is possible. Just my two cents on this one. Doesn't help you w/ an immediate solution, but I think it would help us all in the long run. If this existed, one could easily implement a Token position store and ask it for all of this information, I think. :-) -Grant >>> [EMAIL PROTECTED] 07/22/04 03:19PM >>> > I wonder if the information in termPositions or termVector can be used > to restore token position from indicies? TermFreqVector gives you term frequencies (not positions). This can be of use in computing document similarities. TermPositions gives you the sequence number . eg in the last sentence the word "sequence" was token number 5, (not character position 5). This is used for PhraseQueries to determine proximity. Character position is what is required to do highlighting and this isnt stored anywhere currently. The requirements for such a store would be indexed access by doc number, and a compact means of storing term/character position info. This could add considerable size to the index. Previously we concluded that highlighting is only typically done on the first 10 or so records in a result set anyway and that re-analyzing the text shouldnt add too much of an overhead. If you want to limit the size of an individual document's text to be tokenized use highlighter.setMaxDocBytesToAnalyze(). If you find tokenizing slow check you arent using StandardAnalyzer - I have found that to be slow (see http://marc.theaimsgroup.com/?l=lucene-dev&m=108080820315779&w=2 ) Cheers Mark --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]