Doug Cutting wrote:
I'm not sure what applications people have in mind for Term Vector support but I would prefer to have the original text positions (not term sequence positions) stored so I can offer this:
1) Significant terms/phrases identification
Like "Gigabits" on gigablast.com - used to offer choices of (unstemmed) "significant" terms and phrases for query expansion to the end user.


I would think that this could be done more easily with sequence positions than with character positions: if you're searching for phrases you're trying to find are terms which are adjacent. And most web search engines index unstemmed words. Even if you only indexed stemmed forms, you'd still need to lowercase and otherwise normalize the text before extracting words for comparison.

2) Optimised Highlighting
No more re-tokenizing of text to find unstemmed forms.


Is this really a performance bottleneck? Have you benchmarked it?

I believe so. I have a customer who discovered that searching failed under heavy load whenever the 'smart' version of highlighting was used (which is Mark's code) however was ok once that feature was turned off. My own tests shows that it sometimes took over 800ms to highlight certain large documents (~200k+) which I believe is mostly attributable to the fact that it takes a while to retokenize a document of that size. Having access to the original token offsets at runtime would allow me to completely skip the tokenization and vastly improve the performance of the highlighting code.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature



Reply via email to