Hi Folks,

The basic idea behind the WAL is that for every DB write transaction you first write it into an in-memory buffer and to a region on disk.  RocksDB typically is setup to have multiple WAL buffers, and when one or more fills up, it will start flushing the data to L0 while new writes are written to the next buffer.  If rocksdb can't flush data fast enough, it will throttle write throughput down so that hopefully you don't fill all of of the buffers up and stall before a flush completes.  The combined total size/number of buffers governs both how much disk space you need for the WAL and how much RAM is needed to store incoming IO that hasn't finished flushing into the DB.  There are various tradeoffs when adjust the size, number, and behavior of the WAL.  On one hand there's an advantage to having small buffers to favor frequent swift flush events and hopefully keep overall memory usage low and CPU overhead of key comparisons low.  On the other hand, having large WAL buffers means you have more runway both in terms of being able to absorb longer L0 compaction events but also potentially in terms of being able to avoid writing pglog entries to L0 entirely if a tombstone lands in the same WAL buffer as the initial write.  We've seen evidence that write amplification is (sometimes much) lower with bigger WAL buffers and we think this is a big part of the reason why.


Right now our default WAL settings for rocksdb is:


max_write_buffer_number=4

min_write_buffer_number_to_merge=1

write_buffer_size=268435456


which means we will store up to 4 256MB buffers and start flushing as soon as 1 fills up.  Alternate strategies could be to something like 16 64MB buffers, and set min_write_buffer_number_to_merge to something like 4.  Potentially that might provide slightly more fine grained control and also may be advantageous with a larger number of column families, but we haven't seen evidence yet that splitting the buffers into more smaller segments definitely improves things.  Probably the bigger take-away is that you can't simply make the WAL huge to give yourself extra runway for writes unless you are also willing to eat the RAM cost of storing all of that data in-memory as well. That's one of the reasons why we tell people regularly that 1-2GB is enough for the WAL.  With a target OSD memory of 4GB, (up to) 1GB for the WAL is already pushing it.  Luckily in most cases it doesn't actually use the full 1GB though.  RocksDB will throttle before you get to that point so in reality it's more likely the WAL is probably using more like 0-512MB of Disk/RAM with 2-3 extra buffers of capacity in case things get hairy.


Mark


On 8/15/19 1:59 AM, Janne Johansson wrote:
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri <a...@dreamsnake.net <mailto:a...@dreamsnake.net>>:

    Good points in both posts, but I think there’s still some unclarity.


...

    We’ve seen good explanations on the list of why only specific DB
    sizes, say 30GB, are actually used _for the DB_.
    If the WAL goes along with the DB, shouldn’t we also explicitly
    determine an appropriate size N for the WAL, and make the
    partition (30+N) GB?
    If so, how do we derive N?  Or is it a constant?

    Filestore was so much simpler, 10GB set+forget for the journal. 
    Not that I miss XFS, mind you.


But we got a simple handwaving-best-effort-guesstimate that went "WAL 1GB is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you showed. Not that I think journal=10G was wrong or anything.

--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to