[ 
https://issues.apache.org/jira/browse/CASSANDRA-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919046#action_12919046
 ] 

Peter Schuller commented on CASSANDRA-1576:
-------------------------------------------

Could you clarify where/how I/O is synchronized? It was unexpected to me that 
there would be any need to synchronized read-only I/O on mmap():ed files, and I 
did not easily find any synchronization points in the I/O path of SSTableReader 
and the memory mapped segmented files below that.

Also you mention asynchronous I/O which makes me wonder; are you referring to 
synchronized vs. unsynchronized, in the concurrent multi-threaded sense, I/O or 
are you referring to using synchronous vs. asynchronous I/O API:s? If the 
latter, I am not sure how that applies to reading from mmap():ed memory regions.

Are the millisecond timings you mentioned specifically *within* the read stage, 
or might they include context switching overhead, scheduling delay etc 
associated with submitting a job to the read stage?

And finally, if I read you correctly at the end, you're saying that (1) not 
only is there some synchronization going on effectively serializing parts of 
the read stage, but (2) that this actually applies to the mmap():ed access 
itself such that disk I/O would be part of the serialized path? If that is 
true, I whole-heartedly agree that this is a major issue.


> Improve the I/O subsystem for ROW-READ stage
> --------------------------------------------
>
>                 Key: CASSANDRA-1576
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1576
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.5, 0.7 beta 2
>            Reporter: Chris Goffinet
>
> I did some profiling awhile ago, and noticed that there is quite a bit of 
> overhead that is happening in the ROW-READ stage of Cassandra. My testing was 
> on 0.6 branch. Jonathan mentioned there is endpoint snitch caching in 0.7. 
> One of the pain points is that we do synchronize I/O in our threads. I have 
> observed through profiling and other benchmarks, that even having a very 
> powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of 
> going through to the page cache can still be between 2-3ms (with mmap). I 
> observed at least 800 microseconds more overhead if not using mmap. There is 
> definitely overhead in this stage. I propose we seriously consider moving to 
> doing Asynchronous I/O in each of these threads instead. 
> Imagine the following scenario:
> 3ms with mmap to read from page cache + 1.1ms of function call overhead 
> (observed google iterators in 0.6, could be much better in 0.7)
> That's 4.1ms per message. With 32 threads, at best the machine is only going 
> to be able to serve:
> 7,804 messages/s. 
> This number also means that all your data has to be in page cache. If you 
> start to dip into any set of data that isn't in cache, this number is going 
> to drop substantially, even if your hit rate was 99%.
> Anyone with a serious data set that is greater than the total page cache of 
> the cluster, is going to be victim of major slowdowns as soon as any requests 
> come in needing to fetch data not in cache. If you run without the Direct I/O 
> patch, and you actually have a pretty good write load, you can expect your 
> cluster to fall victim even more with page cache thrashing as new SSTables 
> are read/writen using compaction.
> All of these scenarios mentioned above were seen at Digg with 45-node 
> cluster, 16-core machines with a dataset larger than total page cache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to