Re: minimize disc space requirement.

Erick Erickson Sat, 18 May 2019 10:39:40 -0700

It Depends (tm).

No, limiting the background threads won’t help much. Here’s the issue:
At time T, the segments file contains the current “snapshot” of the index, i.e. 
the names of all the segments that have been committed.

At time T+N, another commit happens. Or, consider an optimize which for 6x 
defaults to merging into a single segment. During any merge, _all_ the new 
segments are written before _any_ old segment is deleted. The very last 
operation is to rewrite the segments file, but only after all the new segments 
are flushed.

After this point, the next time a searcher is opened all the old, 
no-longer-used segments will be deleted, but the trigger is opening a new 
searcher.

To make matters more interesting, during the merge process say new documents 
are indexed. Those go into new segments that aren’t in the totals above. Plus 
you have transaction logs being written which are usually pretty small, but can 
grow between commits.

I’ve used optimize as the example, but it’t at least theoretically possible 
that all the current segments are rewritten into a larger segment as part of a 
normal merge. This is frankly not very likely with large indexes (say > 20G) 
but still possible.

Now all that said, on a disk that’s hosting multiple replicas from multiple 
shards and/or multiple collections, the likelihood of all this happening at 
once (barring someone issuing an optimize for all the collections hosted on the 
machine) is very low. But what you’re risking is an unknown. Lucene/Solr try 
very hard to prevent bad stuff happening on a “disk full” situation, but given 
the number of possible code paths that could be affected it can’t be guaranteed 
to have benign outcomes.

So perhaps you can run forever with, say, 25% of the aggregate index size free. 
Perhaps you’ll blow up unexpectedly and there’s really no way to say ahead of 
time.

Best,
Erick

> On May 18, 2019, at 8:36 AM, tom_s <tom.sm...@gmail.com> wrote:
> 
> hey, 
> im aware that the best practice is to have disk space on your solr servers
> to be 2 times the size of the index. but my goal to minimize this overhead
> and have my index occupy more than 50% of disk space. in our index documents
> have TTL, so documents are deleted every day and it causes background merge
> of segments. can i change the merge policy and make the overhead of
> background merging lower?  
> will limiting the number of concurrent merges help(with the maxMergeCount
> parameter)? do you know other methods that will help? 
> 
> info about my server: 
> i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit
> every 5 minutes. the size of the index in each shard is around 70GB (with
> around 15% deletions) . 
> i use the following merge policy:
> <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
>  <int name="maxMergeAtOnce">2</int>
>  <int name="segmentsPerTier">4</int>
> </mergePolicyFactory>
> (the rest of the params are default) 
> 
> thanks
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: minimize disc space requirement.

Reply via email to