Re: Can we change forceMerge to not need as much disk space?

Shawn Heisey Fri, 13 Sep 2019 12:59:48 -0700

On 9/2/2019 9:19 AM, Erick Erickson wrote:

Anyway, it occurred to me that once a max-sized segment is created, _if_ we 
write the segments_n file out with the current state of the index, we could 
freely delete the segments that were merged into the new one. With 300G indexes 
(which I see regularly in the field, even multiple ones per node that size), 
this could result in substantial disk savings.


<snip>

Off the top of my head, I can see some concerns:
1> we’d have to open new searchers every time we wrote the segments_n file to 
release file handles on the old segments

How would that interact with user applications that normally handleopening new searchers (such as Solr)? When users want there to be nonew searchers until they issue an explicit commit, I think they're goingto be a little irritated if Lucene decides to open a new searcher on itsown. Maybe we'd need to advise people to turn off their indexinganytime they're doing a forceMerge/optimize. That's generally a goodidea anyway, and pretty much required if deleteByQuery is being used.

2> coordinating multiple merge threads

I would think the scheduler already handles that ... thinking about allthis makes my brain hurt ... if I have to think about the scheduler too,there might be implosions. :)

3> maxMergeAtOnceExplicit could mean unnecessary thrashing/opening searchers 
(could this be deprecated?)

It has always bothered me that when I looked for info about changing thepolicy settings, and set the two "main" parts of the policy to 35(instead of the default 10), that the info I was finding never mentionedmaxMergeAtOnceExplicit. I also needed to set this value (to 105) tohave an optimize work like I expected. Without it, a lot more mergingoccurred than was necessary when I did an optimize. This was on areally old version of Solr, either 1.4.x or 3.2.x, back when it wasrelatively new.

The maxMergeAtOnceExplicit setting is not even mentioned in the Solr refguide page about IndexConfig. I got the information for that settingfrom solr-user, when I asked why an optimize with values increased from10 to 35 was doing more merge passes than I thought it needed. I thinkthat either that parameter needs to go away or docs need improvement.

4> Don’t quite know what to do if maxSegments is 1 (or other very low number).

I don't think anything can be done about disk usage for that. Just thenature of the beast.

Something like this would also pave the way for “background optimizing”. 
Instead of a monolithic forceMerge, I can envision a process whereby we created 
a low-level task that merged one max-sized segment at a time, came up for air 
and reopened searchers then went back in and merged the next one. With its own 
problems about coordinating ongoing updates, but that’s another discussion ;).

As mentioned above, I worry about low-level code opening new searchersbecause lots of users want to have that be completely under theircontrol. Maybe TMP needs another setting to tell it whether or not it'sallowed to open searchers, with documentation saying that less diskspace might be required if it is allowed.

It would be awesome to eliminate the huge forceMerge disk requirementfor most users, so I think it's worth exploring. Can the stuff withreaders that Mike mentioned happen without opening a new searcher at theapp level? My knowledge of Lucene internals is unfortunately too vagueto answer my own question.


Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Can we change forceMerge to not need as much disk space?

Reply via email to