Excerpts from Andreas Korth's message of Fri Jan 19 12:26:20 -0800 2007: > Could you elaborate on that, please? What exactly has changed since > the 70's which isn't relevant any more and what is the TREC ad-hoc > query paradigm anyway?
TREC is a competition that arguably drove most information retrieval research for the past several decades. The ad-hoc task is one of the tasks in the competition, and is essentially what we think of as "search": given a fixed set of documents, take an arbitrary query and produce a subset of documents that are considered "relevant". (Other TREC tasks involve things like document clustering, or question answering, or responding to a fixed query on a changing set of documents.) Almost all the ideas behind Ferret, Lucene, etc., are from the IR research community, were evaluated and found to be favorable in the context of TREC. The "inverted" index, stop words, boosting, the twiddle operator, etc, are all many decades old. The problem is that the ad-hoc task is pretty different from, say, web search, or email search in Sup. An ad-hoc query is essentially a mini-document, with a separate title, and several complete, grammatical sentences describing the "information need" in somewhat formal English. By contrast, in our case, the user is typically entering in just a few words, and is typically making explicit use of the mechanics of the search (glorified word matching) and thus isn't entering in a grammatical English description of what he'd like to find. Stop words make a lot of sense for the ad-hoc task because they eliminate "content-free" words. But I think they don't make nearly as much sense for the uses that you and I have for Ferret. The other big difference, of course, is that disk space is much cheaper now than when this stuff was developed. > My understanding is that stop words reduce the size of the index (and > hence speed up queries) by filtering out words that occur frequently > in almost any text of considerable length. Isn't it even worse if you > store term vectors? True, and yes. The question is: by how much? > I'd turn off stop words right away if there wasn't any considerable > impact on performance, but I'd like to have a little more information > on that. I'd appreciate if you could give some pointers. Unfortunately all I have are opinions. :) I'd be very interested in an empirical analysis of just how much bigger the index gets when using stopwords (with and without term vectors), and just how much slower queries get. I'm guessing that neither will be serious, but I could be wrong. -- William <[EMAIL PROTECTED]> _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

