On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Robert,
>
> I've been thinking about this since you suggested it on another thread.  One 
> problem is that it would also remove real words. Apparently 40-60% of the 
> words in large corpora occur only once 
> (http://en.wikipedia.org/wiki/Hapax_legomenon.)
>

You are correct. I really hate recommending you 'remove data', but at
the same time, as perhaps an intermediate step, this could be a
brutally simple approach to move you along.


> I guess the question I need to determine is whether the benefit of removing 
> words that occur only once outweighs the costs in terms of the two use cases 
> outlined above.   When we get our new test server set up, sometime in the 
> next month, I think I will go ahead and prune a test index of 500K docs and 
> do some performance testing just to get an idea of the potential performance 
> gains of pruning the index.

Well, one thing I did with Andrzej's patch is immediately
relevance-test this approach against some corpora I had. The results
are on the JIRA issue, and the test collection itself is in
openrelevance.

In my opinion the p...@n is probably overstated, and the MAP values are
probably understated (due to it being a pooled relevance collection),
but I think its fair to say for that specific large text collection,
pruning terms that only appear in the document a single time does not
hurt relevance.

At the same time I will not dispute that it could actually help p...@n, I
am just saying I'm not sold :)

Either way its extremely interesting, cut your index size in half, and
get the same relevance!

>
> I have some other questions about index pruning, but I want to do a bit more 
> reading and then I'll post a question to either the Solr or Lucene list.  Can 
> you suggest which list I should post an index pruning question to?
>

I would recommend posting it to the JIRA issue:
http://issues.apache.org/jira/browse/LUCENE-1812

This way someone who knows more (Andrzej) could see it, too.


-- 
Robert Muir
rcm...@gmail.com

Reply via email to