[ https://issues.apache.org/jira/browse/LUCENE-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197328#comment-17197328 ]
Robert Muir edited comment on LUCENE-9529 at 9/17/20, 2:23 AM: --------------------------------------------------------------- The current code tracks the total number of chunks, and the total number of "dirty" (incomplete) chunks. Then we compute "tooDirty" like this: {code} /** * Returns true if we should recompress this reader, even though we could bulk merge compressed data * <p> * The last chunk written for a segment is typically incomplete, so without recompressing, * in some worst-case situations (e.g. frequent reopen with tiny flushes), over time the * compression ratio can degrade. This is a safety switch. */ boolean tooDirty(CompressingStoredFieldsReader candidate) { // more than 1% dirty, or more than hard limit of 1024 dirty chunks return candidate.getNumDirtyChunks() > 1024 || candidate.getNumDirtyChunks() * 100 > candidate.getNumChunks(); } {code} Maybe to be more fair, we could use a similar formula but track numDirtyDocs and compare with numDocs (we know this value already)? We could still keep a safety-switch such as 1024 dirty chunks to avoid some worst-case scenario, but just change the ratio at least. was (Author: rcmuir): The current code tracks the total number of chunks, and the total number of "dirty" (incomplete) chunks. Then we compute "tooDirty" like this: {code} /** * Returns true if we should recompress this reader, even though we could bulk merge compressed data * <p> * The last chunk written for a segment is typically incomplete, so without recompressing, * in some worst-case situations (e.g. frequent reopen with tiny flushes), over time the * compression ratio can degrade. This is a safety switch. */ boolean tooDirty(CompressingStoredFieldsReader candidate) { // more than 1% dirty, or more than hard limit of 1024 dirty chunks return candidate.getNumDirtyChunks() > 1024 || candidate.getNumDirtyChunks() * 100 > candidate.getNumChunks(); } {noformat} Maybe to be more fair, we could use a similar formula but track numDirtyDocs and compare with numDocs (we know this value already)? We could still keep a safety-switch such as 1024 dirty chunks to avoid some worst-case scenario, but just change the ratio at least. > Larger stored fields block sizes mean we're more likely to disable optimized > bulk merging > ----------------------------------------------------------------------------------------- > > Key: LUCENE-9529 > URL: https://issues.apache.org/jira/browse/LUCENE-9529 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > > Whenever possible when merging stored fields, Lucene tries to copy the > compressed data instead of decompressing the source segment to then > re-compressing in the destination segment. A problem with this approach is > that if some blocks are incomplete (typically the last block of a segment) > then it remains incomplete in the destination segment too, and if we do it > for too long we end up with a bad compression ratio. So Lucene keeps track of > these incomplete blocks, and makes sure to keep a ratio of incomplete blocks > below 1%. > But as we increased the block size, it has become more likely to have a high > ratio of incomplete blocks. E.g. if you have a segment with 1MB of stored > fields, with 16kB blocks like before, you have 63 complete blocks and 1 > incomplete block, or 1.6%. But now with ~512kB blocks, you have one complete > block and 1 incomplete block, ie. 50%. > I'm not sure how to fix it or even whether it should be fixed but wanted to > open an issue to track this. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org