Re: Proposal: extracting term-level stats from query process

markharw00d Thu, 11 Mar 2004 13:23:09 -0800

Thanks for the response, Doug

My working assumption was that whatever analysis was done in evaluating the query 
would be costly to repeat 
but from your breadown of what is actually required it looks like all of my 
requirements can be met based on
calls to IndexReader#docFreq(term) which I would expect to be very quick.


As for your suggestion on selecting "best fragments" using RamDirectories - for the 
purposes of highlighting, the RAM indexing code and the 
highlighting code (marking up orginal text) would need to find a way to share the 
results of the same tokenization pass if it was to be performant.
Before considering what is involved in coding this I did some benchmarking to compare 
processing times for different operations on 
the same set of 16kb sized docs using the same (stemming) analyzer:
- Tokenization: 86 ms  (avg time taken to simply tokenize the doc)
- Highlighting:  90 ms  ( avg time taken to parse query terms, tokenize. highlight 
query terms and select best fragments using current impl)
- RAM indexing: 118 ms (avg time taken to tokenize and index docs only)

As you can see, the RAM indexing approach to highlighting incurs some noticable 
overheads in its first step before I consider adding the 
steps to fragment docs, query and highlight., so I'm not sure if this approach is 
worth pursuing. I am tempted to just add some idf weighting into
the current highlighter's fragment selection logic.

Cheers
Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal: extracting term-level stats from query process

Reply via email to