Todd Lipcon has submitted this change and it was merged. Change subject: log: reduce segment size from 64MB to 8MB ......................................................................
log: reduce segment size from 64MB to 8MB Currently, we always retain a minimum of one segment of the WAL, even if a tablet is cold and has not performed any writes in a long time. If a server has hundreds or thousands of tablets, keeping a 64MB segment for each tablet adds up to a lot of "wasted" disk space, especially if the WAL is configured to an expensive disk such as an SSD. In addition to wasting space, the current code always re-reads all live WAL segments at startup. Solving this has been an open problem for quite some time, but there are various subtleties (described in KUDU-38). So, as a band-aid, reducing the size of the segment will linearly reduce the amount of work during bootstrap of "cold" tablets. In summary, the expectation is that, in a "cold" server, this will reduce WAL disk space usage and startup time by approximately a factor of 8. I verified this by making these same changes in the configuration of a server with ~6000 cold tablets. The disk usage of the WAL disk went from 381GB to 48GB, with a similar factor reduction in startup time. Because the segment size is reduced by a factor of 8, the patch also increases the max retention segment count by the same factor (from 10 to 80). This has one risk, in that we currently keep an open file descriptor for every retained segment. However, in typical workloads, the number of tablets with a high rate of active writes is not in the hundreds or thousands, and thus the total number of file descriptors is still likely to be manageable. Nevertheless, this patch adds a TODO that we should consider a FileCache for these descriptors if we start to see the usage be problematic. So, if reducing to 8MB is good, why not go further, like 4MB or 1MB? The assumption is that 8MB is still large enough to get good sequential IO throughput on both read and write, and large enough to limit the FD usage as described above. If we add an FD cache at some point, we could consider reducing it further, particularly if running on SSDs where the sequential throughput is less affected by size. Although it's possible that a user had configured max_segments_to_retain based on their own disk space requirements, the flag is marked experimental, so I don't think we have to worry about compatibility in that respect. We should consider changing the units here to be MB rather than segment-count, so that the value is more robust. Change-Id: Iadcda6b085e69cae5a15d54bb4c945d7605d5f98 Reviewed-on: http://gerrit.cloudera.org:8080/6911 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy <mpe...@apache.org> --- M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/log.cc M src/kudu/consensus/log_util.cc M src/kudu/integration-tests/raft_consensus-itest.cc 4 files changed, 48 insertions(+), 4 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/6911 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Iadcda6b085e69cae5a15d54bb4c945d7605d5f98 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>