Re: Dmitry's Term Vector stuff, plus some

Doug Cutting Wed, 25 Feb 2004 10:04:38 -0800

[EMAIL PROTECTED] wrote:

I'm not sure what applications people have in mind for Term Vector support  but I 
would prefer to have the original text positions (not term sequence positions) stored 
so I can offer this:
1) Significant terms/phrases identification
Like "Gigabits" on gigablast.com - used to offer choices of (unstemmed) "significant" 
terms and phrases for query expansion to the end user.

I would think that this could be done more easily with sequence positions than with character positions: if you're searching for phrases you're trying to find are terms which are adjacent. And most web search engines index unstemmed words. Even if you only indexed stemmed forms, you'd still need to lowercase and otherwise normalize the text before extracting words for comparison.

2) Optimised Highlighting
No more re-tokenizing of text to find unstemmed forms.

Is this really a performance bottleneck? Have you benchmarked it?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Dmitry's Term Vector stuff, plus some

Reply via email to