[ https://issues.apache.org/jira/browse/CASSANDRA-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Schuller reassigned CASSANDRA-2660: ----------------------------------------- Assignee: Peter Schuller > BRAF.sync() bug can cause massive commit log write magnification > ---------------------------------------------------------------- > > Key: CASSANDRA-2660 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2660 > Project: Cassandra > Issue Type: Bug > Reporter: Peter Schuller > Assignee: Peter Schuller > Attachments: CASSANDRA-2660-075.txt > > > This was discovered, fixed and tested on 0.7.5. Cursory examination shows it > should still be an issue on trunk/0.8. If people otherwise agree with the > patch I can rebase if necessary. > Problem: > BRAF.flush() is actually broken in the sense that it cannot be called without > close co-operation with the caller. rebuffer() does the co-op by adjusting > bufferOffset and validateBufferBytes appropriately, by sync() doesn't. This > means sync() is broken, and sync() is used by the commit log. > The attached patch moves the bufferOffset/validateBufferBytes handling out > into resetBuffer() and has both flush() and rebuffer() call that. This makes > sync() safe. > What happened was that for batched commit log mode, every time sync() was > called the data buffered so far would get written to the OS and fsync():ed. > But until rebuffer() is called for other reasons as part of the write path, > all subsequent sync():s would result in the very same data (plus whatever was > written since last time) being re-written and fsync():ed again. So first you > write+fsync N bytes, then N+N1, then N+N1+N2... (each N being a batch), until > at some point you trigger a rebuffer() and it starts all over again. > The result is that you see *a lot* more writes to the commit log than are in > fact written to the BRAF. And these writes translate into actual real writes > to the underlying storage device due to fsync(). We had crazy numbers where > we saw spikes upwards of 80 mb/second where the actual throughput was more > like ~ 1 mb second of data to the commit log. > (One can make a possibly weak argument that it is also functionally incorrect > as I can imagine implementations where re-writing the same blocks does > copy-on-write in such a way that you're not necessarily guaranteed to see > before-or-after data, particularly in case of partial page writes. However > that's probably not a practical issue.) > Worthy of noting is that this probably causes added difficulties in fsync() > latencies since the average fsync() will contain a lot more data. Depending > on I/O scheduler and underlying device characteristics, the extra writes > *may* not have a detrimental effect, but I think it's pretty easy to point to > cases where it will be detrimental - in particular if the commit log is on a > non-battery backed drive. Even with a nice battery backed RAID with the > commit log on, the size of the writes probably contributes to difficulty in > making the write requests propagate down without being starved by reads (but > this is speculation, not tested, other than that I've observed commit log > writer starvation that seemed excessive). > This isn't the first subtle BRAF bug. What are people's thoughts on creating > separate abstractions for streaming I/O that can perhaps be a lot more > simple, and use BRAF only for random reads in response to live traffic? (Not > as part of this JIRA, just asking in general.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira