On 7/22/2011 9:32 AM, Pierre GOSSE wrote:
Merging does not happen often enough to keep deleted documents to a low enough
count ?
Maybe there's a need to have "partial" optimization available in solr, meaning
that segment with too much deleted document could be copied to a new file without
unnecessary datas. That way cleaning deleted datas could be compatible with having light
replications.
I'm worried by this idea of deleted documents influencing relevance scores, any
pointer to how important this influence may be ?
I've got a pretty high mergeFactor, for fast indexing. Also, I want to
know for sure and control when merges happen, so I am not leaving it up
to Lucene/Solr.
Right now the largest number of deleted documents on any shard at this
moment is 45347. The shard (17.65GB) contains 9663271 documents, in six
segments. That will be one HUGE segment (from the last optimize) and
five very very tiny segments, each with only a few thousand documents in
them. Tonight when the document distribution process runs, that index
will be optimized again. Tomorrow night a different shard will be
optimized.
Deleted documents can (and do) happen anywhere in the index, so even if
I had a lot of largish segments rather than one huge segment, it's very
likely that just expunging deletes would still result in the entire
index being merged, so I am not losing anything by doing a full
optimize, and I am gaining a small bit of performance.
The 45000 deletes mentioned above represent less than half a percent of
the shard, so the influence on relevance is *probably* not large ... but
that's not something I can say definitively. I think it all depends on
what people are searching for and how common the terms in the deleted
documents are.
Thanks,
Shawn