On 7/22/2011 9:32 AM, Pierre GOSSE wrote:
Merging does not happen often enough to keep deleted documents to a low enough 
count ?

Maybe there's a need to have "partial" optimization available in solr, meaning 
that segment with too much deleted document could be copied to a new file without 
unnecessary datas. That way cleaning deleted datas could be compatible with having light 
replications.

I'm worried by this idea of deleted documents influencing relevance scores, any 
pointer to how important this influence may be ?

I've got a pretty high mergeFactor, for fast indexing. Also, I want to know for sure and control when merges happen, so I am not leaving it up to Lucene/Solr.

Right now the largest number of deleted documents on any shard at this moment is 45347. The shard (17.65GB) contains 9663271 documents, in six segments. That will be one HUGE segment (from the last optimize) and five very very tiny segments, each with only a few thousand documents in them. Tonight when the document distribution process runs, that index will be optimized again. Tomorrow night a different shard will be optimized.

Deleted documents can (and do) happen anywhere in the index, so even if I had a lot of largish segments rather than one huge segment, it's very likely that just expunging deletes would still result in the entire index being merged, so I am not losing anything by doing a full optimize, and I am gaining a small bit of performance.

The 45000 deletes mentioned above represent less than half a percent of the shard, so the influence on relevance is *probably* not large ... but that's not something I can say definitively. I think it all depends on what people are searching for and how common the terms in the deleted documents are.

Thanks,
Shawn

Reply via email to