Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6911

to look at the new patch set (#2).

Change subject: log: reduce segment size from 64MB to 8MB
......................................................................

log: reduce segment size from 64MB to 8MB

Currently, we always retain a minimum of one segment of the WAL, even if
a tablet is cold and has not performed any writes in a long time. If a
server has hundreds or thousands of tablets, keeping a 64MB segment for
each tablet adds up to a lot of "wasted" disk space, especially if the
WAL is configured to an expensive disk such as an SSD.

In addition to wasting space, the current code always re-reads all live
WAL segments at startup. Solving this has been an open problem for quite
some time, but there are various subtleties (described in KUDU-38). So,
as a band-aid, reducing the size of the segment will linearly reduce the
amount of work during bootstrap of "cold" tablets.

In summary, the expectation is that, in a "cold" server, this will
reduce WAL disk space usage and startup time by approximately a factor
of 8.

I verified this by making these same changes in the configuration of a server
with ~6000 cold tablets. The disk usage of the WAL disk went from 381GB to
48GB, with a similar factor reduction in startup time.

Because the segment size is reduced by a factor of 8, the patch also
increases the max retention segment count by the same factor (from 10 to
80). This has one risk, in that we currently keep an open file
descriptor for every retained segment. However, in typical workloads,
the number of tablets with a high rate of active writes is not in the
hundreds or thousands, and thus the total number of file descriptors is
still likely to be manageable. Nevertheless, this patch adds a TODO that
we should consider a FileCache for these descriptors if we start to see
the usage be problematic.

So, if reducing to 8MB is good, why not go further, like 4MB or 1MB? The
assumption is that 8MB is still large enough to get good sequential IO
throughput on both read and write, and large enough to limit the FD
usage as described above. If we add an FD cache at some point, we could
consider reducing it further, particularly if running on SSDs where the
sequential throughput is less affected by size.

Although it's possible that a user had configured max_segments_to_retain
based on their own disk space requirements, the flag is marked
experimental, so I don't think we have to worry about compatibility in
that respect. We should consider changing the units here to be MB rather
than segment-count, so that the value is more robust.

Change-Id: Iadcda6b085e69cae5a15d54bb4c945d7605d5f98
---
M src/kudu/consensus/log.cc
M src/kudu/consensus/log_util.cc
2 files changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6911/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6911
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iadcda6b085e69cae5a15d54bb4c945d7605d5f98
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to