[ 
https://issues.apache.org/jira/browse/CASSANDRA-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920684#action_12920684
 ] 

Peter Schuller commented on CASSANDRA-1576:
-------------------------------------------

We spoke on IRC a bit and I just wanted to summarize the results here (Chris, 
please correct me if I am mis-representing anything):

* We agree mmap() should be the most efficient approach for cached data.

* The synchronization concern was synchronization implicit in filling up the 
RRS rather than explicit synchronization in the read path.

* With respect to categorizing in-memory queries vs. queries that will go down 
to disk, Chris pointed out mincore(2) which allows a userland app to test 
whether a range is in memory or not (but it's not immediately obvious whether a 
mincore() call is cheap enough to be used in this context).

* The intent with libaio isn't that a libaio syscall is necessarily cheaper 
than a synchronous I/O call, but rather the intended goal is to have an I/O 
concurrency/queue depth significantly higher than RRS concurrency.

* While libaio is probably not worth it to saturate the disk subsystem (where 
concurrency is reasonably limited assuming non-humongous nodes), some workloads 
may benefit quite a lot from significant queue depth if the reads are 
particularly susceptible to optimization by the underlying disks (due to 
platter location/relative positioning) etc.


> Improve the I/O subsystem for ROW-READ stage
> --------------------------------------------
>
>                 Key: CASSANDRA-1576
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1576
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.5, 0.7 beta 2
>            Reporter: Chris Goffinet
>
> I did some profiling awhile ago, and noticed that there is quite a bit of 
> overhead that is happening in the ROW-READ stage of Cassandra. My testing was 
> on 0.6 branch. Jonathan mentioned there is endpoint snitch caching in 0.7. 
> One of the pain points is that we do synchronize I/O in our threads. I have 
> observed through profiling and other benchmarks, that even having a very 
> powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of 
> going through to the page cache can still be between 2-3ms (with mmap). I 
> observed at least 800 microseconds more overhead if not using mmap. There is 
> definitely overhead in this stage. I propose we seriously consider moving to 
> doing Asynchronous I/O in each of these threads instead. 
> Imagine the following scenario:
> 3ms with mmap to read from page cache + 1.1ms of function call overhead 
> (observed google iterators in 0.6, could be much better in 0.7)
> That's 4.1ms per message. With 32 threads, at best the machine is only going 
> to be able to serve:
> 7,804 messages/s. 
> This number also means that all your data has to be in page cache. If you 
> start to dip into any set of data that isn't in cache, this number is going 
> to drop substantially, even if your hit rate was 99%.
> Anyone with a serious data set that is greater than the total page cache of 
> the cluster, is going to be victim of major slowdowns as soon as any requests 
> come in needing to fetch data not in cache. If you run without the Direct I/O 
> patch, and you actually have a pretty good write load, you can expect your 
> cluster to fall victim even more with page cache thrashing as new SSTables 
> are read/writen using compaction.
> All of these scenarios mentioned above were seen at Digg with 45-node 
> cluster, 16-core machines with a dataset larger than total page cache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to