[jira] Commented: (CASSANDRA-1576) Improve the I/O subsystem for ROW-READ stage

Peter Schuller (JIRA) Tue, 12 Oct 2010 12:59:57 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920323#action_12920323
 ]


Peter Schuller commented on CASSANDRA-1576:
-------------------------------------------

Ok. I am a bit skeptical, but perhaps I am misunderstanding what you're 
proposing. Also, for lack of time, I have not attempted to replicat your 
benchmarking results so I am speaking without numbers to back it up here... I 
hope you don't feel I'm being out of place.

If I understand you correctly, you are proposing that reads would still be 
served by multiple threads in the row read stage, each one of which would be 
using aio instead of mmap():ed access for reads?

One reason I am skeptical is that mmap() *really* should be the fastest option 
for the cached case (i.e., when you don't take a page fault). It should 
literally just be memory access and overhead associated with the mmap() itself 
should be effectively zero given that the same features of the CPU's virtual 
memory support are exercised as with any other memory access within the 
process. Using aio or regular synchronous I/O calls should at the very least 
imply additional syscall overhead even if zero-copy transference between kernel 
space and user space were to be in effect (I don't think it is?).

If mmap():ed I/O is truly slower in the cached in-memory case than doing a 
syscall for I/O (synchronous or aio alike), my feeling is that something feels 
outright wrong - either in the Java code or the JVM's handling of access to 
mmap():ed regions. Of course if there is a JVM issue then that just means 
working around it for Cassandra since we cannot be relying on JVM patching...

For the non-cached case (i.e., you take the page fault and go down to disk), I 
find it a bit more plausible that perhaps, as an implementation detail but not 
fundamentally implied, the page faulting path in the kernel is slower than a 
direct I/O system call (whether synchronous or asynchronous). Even then however 
I am unclear on why the asynchronous I/O would be expected to be faster, except 
as an artifact of a kernel implementation detail, if my interpretation is 
correct that the intention is still do have each thread perform it's own I/O 
(so there is no coalescing of I/O vectors going on).

Again I want to stress that I have made zero attempts to actually benchmark 
this and maybe I'm missing something obvious.

Also, I am still not clear on how non-cached disk access is affected cached 
disk access. In the absence of explicit synchronization by Cassandra, why would 
page faulting memory access block other readers that only touched cached pages? 

Unless you mean that an over-loaded system where disks are complete saturated 
will, due to the fixed thread poll size design, block requests that only touch 
cached data from being processed. If this is to be addressed, I definitely see 
the value of using aio *if* the I/O is done by one (or a few) dedicated threads 
that take read requests. However even then *some* limit would presumably have 
to be imposed, as long as Cassandra is unable to classify requests by "won't 
touch anything outside of cache" vs. "will touch disk" before attempting the 
read - which seems impossible unless Cassandra were to implement its own page 
cache.


> Improve the I/O subsystem for ROW-READ stage
> --------------------------------------------
>
>                 Key: CASSANDRA-1576
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1576
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.5, 0.7 beta 2
>            Reporter: Chris Goffinet
>
> I did some profiling awhile ago, and noticed that there is quite a bit of 
> overhead that is happening in the ROW-READ stage of Cassandra. My testing was 
> on 0.6 branch. Jonathan mentioned there is endpoint snitch caching in 0.7. 
> One of the pain points is that we do synchronize I/O in our threads. I have 
> observed through profiling and other benchmarks, that even having a very 
> powerful machine (16-core Nehalem, 32GB of RAM), the amount of overhead of 
> going through to the page cache can still be between 2-3ms (with mmap). I 
> observed at least 800 microseconds more overhead if not using mmap. There is 
> definitely overhead in this stage. I propose we seriously consider moving to 
> doing Asynchronous I/O in each of these threads instead. 
> Imagine the following scenario:
> 3ms with mmap to read from page cache + 1.1ms of function call overhead 
> (observed google iterators in 0.6, could be much better in 0.7)
> That's 4.1ms per message. With 32 threads, at best the machine is only going 
> to be able to serve:
> 7,804 messages/s. 
> This number also means that all your data has to be in page cache. If you 
> start to dip into any set of data that isn't in cache, this number is going 
> to drop substantially, even if your hit rate was 99%.
> Anyone with a serious data set that is greater than the total page cache of 
> the cluster, is going to be victim of major slowdowns as soon as any requests 
> come in needing to fetch data not in cache. If you run without the Direct I/O 
> patch, and you actually have a pretty good write load, you can expect your 
> cluster to fall victim even more with page cache thrashing as new SSTables 
> are read/writen using compaction.
> All of these scenarios mentioned above were seen at Digg with 45-node 
> cluster, 16-core machines with a dataset larger than total page cache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1576) Improve the I/O subsystem for ROW-READ stage

Reply via email to