[ https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307177#comment-17307177 ]
Robert Muir commented on LUCENE-9827: ------------------------------------- [~jpountz] As far as the "1% dirty docs", it was equally wrong before, because a chunk with only one small document would get counted as having a huge number of dirty docs? This was part of what causes the issue with small segments here, they would be unfairly punished, when in reality, that one small document isn't causing that much harm. It is the same (but opposite) as your argument for a mostly-full chunk, if we use with docs percentage, mostly-full block gets unfairly punished now when in fact its not causing much waste. So either way is not great, but I would rather have exact statistics in the index: not estimates. And I need to know for the "floor" check that we really can create a new clean block to prevent tons of waste :) Based on exact statistics, we may reconsider the second part of the formula, at least in the future. I'd really rather do any estimating here, at "runtime", versus at index time and then summing up errors. For example we could even go as far as to estimate the projected bytes savings and compare against total size of the file or something like that. I don't really have an opinion on doing the percentage solely based on number of chunks. I'm a bit worried it might be too aggressive after strange merges (e.g. big + littles) and cause unnecessary compression for little gain. But I'm not against it. If you want to push updates to my branch, feel free, I will be happy to re-run this benchmark against them. FYI: I think the stored fields/term vectors index currently knows getNumChunks already, in case you want to try this formula. > Small segments are slower to merge due to stored fields since 8.7 > ----------------------------------------------------------------- > > Key: LUCENE-9827 > URL: https://issues.apache.org/jira/browse/LUCENE-9827 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > Attachments: Indexer.java, log-and-lucene-9827.patch, > merge-count-by-num-docs.png, merge-type-by-version.png, > total-merge-time-by-num-docs-on-small-segments.png, > total-merge-time-by-num-docs.png > > Time Spent: 10m > Remaining Estimate: 0h > > [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed > down after upgrading to 8.7. After digging we identified that this was due to > the merging of stored fields, which had become slower on average. > This is due to changes to stored fields, which now have top-level blocks that > are then split into sub-blocks and compressed using shared dictionaries (one > dictionary per top-level block). As the top-level blocks are larger than they > were before, segments are more likely to be considered "dirty" by the merging > logic. Dirty segments are segments were 1% of the data or more consists of > incomplete blocks. For large segments, the size of blocks doesn't really > affect the dirtiness of segments: if you flush a segment that has 100 blocks > or more, it will never be considered dirty as only the last block may be > incomplete. But for small segments it does: for instance if your segment is > only 10 blocks, it is very likely considered dirty given that the last block > is always incomplete. And the fact that we increased the top-level block size > means that segments that used to be considered clean might now be considered > dirty. > And indeed benchmarks reported that while large stored fields merges became > slightly faster after upgrading to 8.7, the smaller merges actually became > slower. See attached chart, which gives the total merge time as a function of > the number of documents in the segment. > I don't know how we can address this, this is a natural consequence of the > larger block size, which is needed to achieve better compression ratios. But > I wanted to open an issue about it in case someone has a bright idea how we > could make things better. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org