Hi Folks,
The basic idea behind the WAL is that for every DB write transaction you
first write it into an in-memory buffer and to a region on disk.
RocksDB typically is setup to have multiple WAL buffers, and when one or
more fills up, it will start flushing the data to L0 while new writes
are written to the next buffer. If rocksdb can't flush data fast
enough, it will throttle write throughput down so that hopefully you
don't fill all of of the buffers up and stall before a flush completes.
The combined total size/number of buffers governs both how much disk
space you need for the WAL and how much RAM is needed to store incoming
IO that hasn't finished flushing into the DB. There are various
tradeoffs when adjust the size, number, and behavior of the WAL. On one
hand there's an advantage to having small buffers to favor frequent
swift flush events and hopefully keep overall memory usage low and CPU
overhead of key comparisons low. On the other hand, having large WAL
buffers means you have more runway both in terms of being able to absorb
longer L0 compaction events but also potentially in terms of being able
to avoid writing pglog entries to L0 entirely if a tombstone lands in
the same WAL buffer as the initial write. We've seen evidence that
write amplification is (sometimes much) lower with bigger WAL buffers
and we think this is a big part of the reason why.
Right now our default WAL settings for rocksdb is:
max_write_buffer_number=4
min_write_buffer_number_to_merge=1
write_buffer_size=268435456
which means we will store up to 4 256MB buffers and start flushing as
soon as 1 fills up. Alternate strategies could be to something like 16
64MB buffers, and set min_write_buffer_number_to_merge to something like
4. Potentially that might provide slightly more fine grained control
and also may be advantageous with a larger number of column families,
but we haven't seen evidence yet that splitting the buffers into more
smaller segments definitely improves things. Probably the bigger
take-away is that you can't simply make the WAL huge to give yourself
extra runway for writes unless you are also willing to eat the RAM cost
of storing all of that data in-memory as well. That's one of the reasons
why we tell people regularly that 1-2GB is enough for the WAL. With a
target OSD memory of 4GB, (up to) 1GB for the WAL is already pushing
it. Luckily in most cases it doesn't actually use the full 1GB though.
RocksDB will throttle before you get to that point so in reality it's
more likely the WAL is probably using more like 0-512MB of Disk/RAM with
2-3 extra buffers of capacity in case things get hairy.
Mark
On 8/15/19 1:59 AM, Janne Johansson wrote:
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri
<a...@dreamsnake.net <mailto:a...@dreamsnake.net>>:
Good points in both posts, but I think there’s still some unclarity.
...
We’ve seen good explanations on the list of why only specific DB
sizes, say 30GB, are actually used _for the DB_.
If the WAL goes along with the DB, shouldn’t we also explicitly
determine an appropriate size N for the WAL, and make the
partition (30+N) GB?
If so, how do we derive N? Or is it a constant?
Filestore was so much simpler, 10GB set+forget for the journal.
Not that I miss XFS, mind you.
But we got a simple handwaving-best-effort-guesstimate that went "WAL
1GB is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you
showed. Not that I think journal=10G was wrong or anything.
--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com