[jira] Commented: (CASSANDRA-1576) Improve the I/O subsystem for ROW-READ stage

Chris Goffinet (JIRA) Tue, 19 Oct 2010 13:09:50 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922697#action_12922697
 ]


Chris Goffinet commented on CASSANDRA-1576:
-------------------------------------------

One note in regards to libaio, the benefit of using libaio would be for work 
loads such as needing to run map/reduce jobs against Cassandra nodes that 
require a large amount of I/O to the disk. If we can batch more requests to 
io_submit(), it would allow the I/O scheduler to better coalesce events.

> Improve the I/O subsystem for ROW-READ stage
> --------------------------------------------
>
>                 Key: CASSANDRA-1576
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1576
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.5, 0.7 beta 2
>            Reporter: Chris Goffinet
>
> I did some profiling awhile ago, and noticed that there is quite a bit of 
> overhead that is happening in the ROW-READ stage of Cassandra. My testing was 
> on 0.6 branch. Jonathan mentioned there is endpoint snitch caching in 0.7. 
> One of the pain points is that we do synchronize I/O in our threads. I have 
> observed through profiling and other benchmarks, that even having a very 
> powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of 
> going through to the page cache can still be between 2-3ms (with mmap). I 
> observed at least 800 microseconds more overhead if not using mmap. There is 
> definitely overhead in this stage. I propose we seriously consider moving to 
> doing Asynchronous I/O in each of these threads instead. 
> Imagine the following scenario:
> 3ms with mmap to read from page cache + 1.1ms of function call overhead 
> (observed google iterators in 0.6, could be much better in 0.7)
> That's 4.1ms per message. With 32 threads, at best the machine is only going 
> to be able to serve:
> 7,804 messages/s. 
> This number also means that all your data has to be in page cache. If you 
> start to dip into any set of data that isn't in cache, this number is going 
> to drop substantially, even if your hit rate was 99%.
> Anyone with a serious data set that is greater than the total page cache of 
> the cluster, is going to be victim of major slowdowns as soon as any requests 
> come in needing to fetch data not in cache. If you run without the Direct I/O 
> patch, and you actually have a pretty good write load, you can expect your 
> cluster to fall victim even more with page cache thrashing as new SSTables 
> are read/writen using compaction.
> All of these scenarios mentioned above were seen at Digg with 45-node 
> cluster, 16-core machines with a dataset larger than total page cache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1576) Improve the I/O subsystem for ROW-READ stage

Reply via email to