[ https://issues.apache.org/jira/browse/CASSANDRA-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922697#action_12922697 ]
Chris Goffinet commented on CASSANDRA-1576: ------------------------------------------- One note in regards to libaio, the benefit of using libaio would be for work loads such as needing to run map/reduce jobs against Cassandra nodes that require a large amount of I/O to the disk. If we can batch more requests to io_submit(), it would allow the I/O scheduler to better coalesce events. > Improve the I/O subsystem for ROW-READ stage > -------------------------------------------- > > Key: CASSANDRA-1576 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1576 > Project: Cassandra > Issue Type: Improvement > Components: Core > Affects Versions: 0.6.5, 0.7 beta 2 > Reporter: Chris Goffinet > > I did some profiling awhile ago, and noticed that there is quite a bit of > overhead that is happening in the ROW-READ stage of Cassandra. My testing was > on 0.6 branch. Jonathan mentioned there is endpoint snitch caching in 0.7. > One of the pain points is that we do synchronize I/O in our threads. I have > observed through profiling and other benchmarks, that even having a very > powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of > going through to the page cache can still be between 2-3ms (with mmap). I > observed at least 800 microseconds more overhead if not using mmap. There is > definitely overhead in this stage. I propose we seriously consider moving to > doing Asynchronous I/O in each of these threads instead. > Imagine the following scenario: > 3ms with mmap to read from page cache + 1.1ms of function call overhead > (observed google iterators in 0.6, could be much better in 0.7) > That's 4.1ms per message. With 32 threads, at best the machine is only going > to be able to serve: > 7,804 messages/s. > This number also means that all your data has to be in page cache. If you > start to dip into any set of data that isn't in cache, this number is going > to drop substantially, even if your hit rate was 99%. > Anyone with a serious data set that is greater than the total page cache of > the cluster, is going to be victim of major slowdowns as soon as any requests > come in needing to fetch data not in cache. If you run without the Direct I/O > patch, and you actually have a pretty good write load, you can expect your > cluster to fall victim even more with page cache thrashing as new SSTables > are read/writen using compaction. > All of these scenarios mentioned above were seen at Digg with 45-node > cluster, 16-core machines with a dataset larger than total page cache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.