On 9/2/2019 9:19 AM, Erick Erickson wrote:
Anyway, it occurred to me that once a max-sized segment is created, _if_ we 
write the segments_n file out with the current state of the index, we could 
freely delete the segments that were merged into the new one. With 300G indexes 
(which I see regularly in the field, even multiple ones per node that size), 
this could result in substantial disk savings.

<snip>

Off the top of my head, I can see some concerns:
1> we’d have to open new searchers every time we wrote the segments_n file to 
release file handles on the old segments

How would that interact with user applications that normally handle opening new searchers (such as Solr)? When users want there to be no new searchers until they issue an explicit commit, I think they're going to be a little irritated if Lucene decides to open a new searcher on its own. Maybe we'd need to advise people to turn off their indexing anytime they're doing a forceMerge/optimize. That's generally a good idea anyway, and pretty much required if deleteByQuery is being used.

2> coordinating multiple merge threads

I would think the scheduler already handles that ... thinking about all this makes my brain hurt ... if I have to think about the scheduler too, there might be implosions. :)

3> maxMergeAtOnceExplicit could mean unnecessary thrashing/opening searchers 
(could this be deprecated?)

It has always bothered me that when I looked for info about changing the policy settings, and set the two "main" parts of the policy to 35 (instead of the default 10), that the info I was finding never mentioned maxMergeAtOnceExplicit. I also needed to set this value (to 105) to have an optimize work like I expected. Without it, a lot more merging occurred than was necessary when I did an optimize. This was on a really old version of Solr, either 1.4.x or 3.2.x, back when it was relatively new.

The maxMergeAtOnceExplicit setting is not even mentioned in the Solr ref guide page about IndexConfig. I got the information for that setting from solr-user, when I asked why an optimize with values increased from 10 to 35 was doing more merge passes than I thought it needed. I think that either that parameter needs to go away or docs need improvement.

4> Don’t quite know what to do if maxSegments is 1 (or other very low number).

I don't think anything can be done about disk usage for that. Just the nature of the beast.

Something like this would also pave the way for “background optimizing”. 
Instead of a monolithic forceMerge, I can envision a process whereby we created 
a low-level task that merged one max-sized segment at a time, came up for air 
and reopened searchers then went back in and merged the next one. With its own 
problems about coordinating ongoing updates, but that’s another discussion ;).

As mentioned above, I worry about low-level code opening new searchers because lots of users want to have that be completely under their control. Maybe TMP needs another setting to tell it whether or not it's allowed to open searchers, with documentation saying that less disk space might be required if it is allowed.

It would be awesome to eliminate the huge forceMerge disk requirement for most users, so I think it's worth exploring. Can the stuff with readers that Mike mentioned happen without opening a new searcher at the app level? My knowledge of Lucene internals is unfortunately too vague to answer my own question.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to