[ https://issues.apache.org/jira/browse/CASSANDRA-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920684#action_12920684 ]
Peter Schuller commented on CASSANDRA-1576: ------------------------------------------- We spoke on IRC a bit and I just wanted to summarize the results here (Chris, please correct me if I am mis-representing anything): * We agree mmap() should be the most efficient approach for cached data. * The synchronization concern was synchronization implicit in filling up the RRS rather than explicit synchronization in the read path. * With respect to categorizing in-memory queries vs. queries that will go down to disk, Chris pointed out mincore(2) which allows a userland app to test whether a range is in memory or not (but it's not immediately obvious whether a mincore() call is cheap enough to be used in this context). * The intent with libaio isn't that a libaio syscall is necessarily cheaper than a synchronous I/O call, but rather the intended goal is to have an I/O concurrency/queue depth significantly higher than RRS concurrency. * While libaio is probably not worth it to saturate the disk subsystem (where concurrency is reasonably limited assuming non-humongous nodes), some workloads may benefit quite a lot from significant queue depth if the reads are particularly susceptible to optimization by the underlying disks (due to platter location/relative positioning) etc. > Improve the I/O subsystem for ROW-READ stage > -------------------------------------------- > > Key: CASSANDRA-1576 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1576 > Project: Cassandra > Issue Type: Improvement > Components: Core > Affects Versions: 0.6.5, 0.7 beta 2 > Reporter: Chris Goffinet > > I did some profiling awhile ago, and noticed that there is quite a bit of > overhead that is happening in the ROW-READ stage of Cassandra. My testing was > on 0.6 branch. Jonathan mentioned there is endpoint snitch caching in 0.7. > One of the pain points is that we do synchronize I/O in our threads. I have > observed through profiling and other benchmarks, that even having a very > powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of > going through to the page cache can still be between 2-3ms (with mmap). I > observed at least 800 microseconds more overhead if not using mmap. There is > definitely overhead in this stage. I propose we seriously consider moving to > doing Asynchronous I/O in each of these threads instead. > Imagine the following scenario: > 3ms with mmap to read from page cache + 1.1ms of function call overhead > (observed google iterators in 0.6, could be much better in 0.7) > That's 4.1ms per message. With 32 threads, at best the machine is only going > to be able to serve: > 7,804 messages/s. > This number also means that all your data has to be in page cache. If you > start to dip into any set of data that isn't in cache, this number is going > to drop substantially, even if your hit rate was 99%. > Anyone with a serious data set that is greater than the total page cache of > the cluster, is going to be victim of major slowdowns as soon as any requests > come in needing to fetch data not in cache. If you run without the Direct I/O > patch, and you actually have a pretty good write load, you can expect your > cluster to fall victim even more with page cache thrashing as new SSTables > are read/writen using compaction. > All of these scenarios mentioned above were seen at Digg with 45-node > cluster, 16-core machines with a dataset larger than total page cache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.