BRAF.sync() bug can cause massive commit log write magnification
----------------------------------------------------------------

                 Key: CASSANDRA-2660
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2660
             Project: Cassandra
          Issue Type: Bug
            Reporter: Peter Schuller
         Attachments: CASSANDRA-2660-075.txt

This was discovered, fixed and tested on 0.7.5. Cursory examination shows it 
should still be an issue on trunk/0.8. If people otherwise agree with the patch 
I can rebase if necessary.

Problem:

BRAF.flush() is actually broken in the sense that it cannot be called without 
close co-operation with the caller. rebuffer() does the co-op by adjusting 
bufferOffset and validateBufferBytes appropriately, by sync() doesn't. This 
means sync() is broken, and sync() is used by the commit log.

The attached patch moves the bufferOffset/validateBufferBytes handling out into 
resetBuffer() and has both flush() and rebuffer() call that. This makes sync() 
safe.

What happened was that for batched commit log mode, every time sync() was 
called the data buffered so far would get written to the OS and fsync():ed. But 
until rebuffer() is called for other reasons as part of the write path, all 
subsequent sync():s would result in the very same data (plus whatever was 
written since last time) being re-written and fsync():ed again. So first you 
write+fsync N bytes, then N+N1, then N+N1+N2... (each N being a batch), until 
at some point you trigger a rebuffer() and it starts all over again.

The result is that you see *a lot* more writes to the commit log than are in 
fact written to the BRAF. And these writes translate into actual real writes to 
the underlying storage device due to fsync(). We had crazy numbers where we saw 
spikes upwards of 80 mb/second where the actual throughput was more like ~ 1 mb 
second of data to the commit log.

(One can make a possibly weak argument that it is also functionally incorrect 
as I can imagine implementations where re-writing the same blocks does 
copy-on-write in such a way that you're not necessarily guaranteed to see 
before-or-after data, particularly in case of partial page writes. However 
that's probably not a practical issue.)

Worthy of noting is that this probably causes added difficulties in fsync() 
latencies since the average fsync() will contain a lot more data. Depending on 
I/O scheduler and underlying device characteristics, the extra writes *may* not 
have a detrimental effect, but I think it's pretty easy to point to cases where 
it will be detrimental - in particular if the commit log is on a non-battery 
backed drive. Even with a nice battery backed RAID with the commit log on, the 
size of the writes probably contributes to difficulty in making the write 
requests propagate down without being starved by reads (but this is 
speculation, not tested, other than that I've observed commit log writer 
starvation that seemed excessive).

This isn't the first subtle BRAF bug. What are people's thoughts on creating 
separate abstractions for streaming I/O that can perhaps be a lot more simple, 
and use BRAF only for random reads in response to live traffic? (Not as part of 
this JIRA, just asking in general.)


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to