On Nov 26, 2007, at 6:34 PM, Eswar K wrote:

Although the algorithm doesn't understand anything
about what the words *mean*, the patterns it notices can make it seem
astonishingly intelligent.

When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically
very close even if they do not share a particular keyword,

Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all.

Perhaps I should have been less curt. I've read a few papers on LSA, so I'm familiar at least in passing with everything you describe above. It would be entertaining to write an implementation, and I've considered it... but it's a low priority while the patent's in force.

A full term-vector space calculation is... expensive :) ... so LSA performs reduction. Tuning the algorithm for a threshold effect not just against "n words in common" but against a rough approximation of "n words in common" is presumably non-trivial.

If you can either find or write open source software that pulls off such "astonishingly intelligent" matches despite the many challenges, kudos. I'd love to see it.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Reply via email to