On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Thanks Robert, > > I've been thinking about this since you suggested it on another thread. One > problem is that it would also remove real words. Apparently 40-60% of the > words in large corpora occur only once > (http://en.wikipedia.org/wiki/Hapax_legomenon.) >
You are correct. I really hate recommending you 'remove data', but at the same time, as perhaps an intermediate step, this could be a brutally simple approach to move you along. > I guess the question I need to determine is whether the benefit of removing > words that occur only once outweighs the costs in terms of the two use cases > outlined above. When we get our new test server set up, sometime in the > next month, I think I will go ahead and prune a test index of 500K docs and > do some performance testing just to get an idea of the potential performance > gains of pruning the index. Well, one thing I did with Andrzej's patch is immediately relevance-test this approach against some corpora I had. The results are on the JIRA issue, and the test collection itself is in openrelevance. In my opinion the p...@n is probably overstated, and the MAP values are probably understated (due to it being a pooled relevance collection), but I think its fair to say for that specific large text collection, pruning terms that only appear in the document a single time does not hurt relevance. At the same time I will not dispute that it could actually help p...@n, I am just saying I'm not sold :) Either way its extremely interesting, cut your index size in half, and get the same relevance! > > I have some other questions about index pruning, but I want to do a bit more > reading and then I'll post a question to either the Solr or Lucene list. Can > you suggest which list I should post an index pruning question to? > I would recommend posting it to the JIRA issue: http://issues.apache.org/jira/browse/LUCENE-1812 This way someone who knows more (Andrzej) could see it, too. -- Robert Muir rcm...@gmail.com