Tips:

1) Don't send to 3 mail lists when 1 will do please
continue this conversation on java-user only.

2) Most "suggest" tools work off an index of previous
searches (not documents). Do you have a large set of
searches? If not, making sensible suggestions based on
document content can be much more compute intensive.
My assumption here is you are having to work with doc
content.

3) You don't need to go to the expense of running a
query and ranking and scoring documents - look at the
lower level APIs terms() and termDocs() - use them to
find the matching terms

4) word suggestions ideally shouldn't be independent
of each other - look at completed words in the query
string and use them to inform the selection of
suggestions for the incomplete term being typed. The
termDocs()/termPositions() apis give you all the data
you need to establish what docs/positions exist for
completed terms and these can be cross-referenced with
the list of docs/positions for the "alternative" terms
under consideration. A high proximity between
completed term occurences and a suggested term's
occurences makes a strong candidate. A fast way to do
proximity tests might be to compared sorted arrays of
numbers where each number represents a term using a
function like:
  termspaceNumber=[DocNumber * maxNumTermsPerDoc]+
termPositionInDoc

You could then compare long[]completedTermOccurences
with long[]suggestedAlternativeTermOccurences looking
for matches where numbers differ by 1 or 2.

A faster (rougher) comparison solution which ignored
word proximity would be just to compare bitsets of doc
ids looking for high levels of
overlap(intersection/union).

You can use TermEnum.docFreq() to quickly rule out
very rare words from your calculations.

Cheers,
Mark

Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to