Yes, as the size of the data on disk increases and the OS cannot avoid disk 
seeks the read performance degrades.  You can see this in the results from the 
original post where the number of keys in the test goes from 10M to 100M the 
reads drop from 4,600/s to 200/s.  10M keys in the stress.py test corresponds 
to roughly 5GB on disk.  If the frequently used records are small enough to fit 
into cache, you can restore read performance by lifting them into the row cache 
and largely avoiding seeks during reads.

On Jul 17, 2010, at 1:05 AM, Schubert Zhang wrote:

I think your read throughput is very high, and it may be unauthentic.

For random read, the disk seek will always be the bottleneck (100% utils)
There will be about 3 random disk-seeks for a random read, and aout 10ms for 
one seek. So, there will be 30ms for a random read.

If you have only one disk, the read throughput will be about 40reads/s.

High tested throughput may because of the Linux FS cache, if your dataset is 
small (for example only 1GB).
Try to test the random read throughput on 100GB or 1TB, you may get different 
result.


On Sat, Jul 17, 2010 at 7:06 AM, Oren Benjamin 
<o...@clearspring.com<mailto:o...@clearspring.com>> wrote:
I've been doing quite a bit of benchmarking of Cassandra in the cloud using 
stress.py  I'm working on a comprehensive spreadsheet of results with a 
template that others can add to, but for now I thought I'd post some of the 
basic results here to get some feedback from others.

The first goal was to reproduce the test described on spyced here: 
http://spyced.blogspot.com/2010/01/cassandra-05.html

Using Cassandra 0.6.3, a 4GB/160GB cloud server 
(http://www.rackspacecloud.com/cloud_hosting_products/servers/pricing) with 
default storage-conf.xml and cassandra.in.sh<http://cassandra.in.sh/>, here's 
what I got:

Reads: 4,800/s
Writes: 9,000/s

Pretty close to the result posted on the blog, with a slightly lower write 
performance (perhaps due to the availability of only a single disk for both 
commitlog and data).

That was with 1M keys (the blog used 700K).

As number of keys scale read performance degrades as would be expected with no 
caching:
1M 4,800 reads/s
10M 4,600 reads/s
25M 700 reads/s
100M 200 reads/s

Using row cache and an appropriate choice of --stdev to achieve a cache hit 
rate of >90% restores the read performance to the 4,800 reads/s level in all 
cases.  Also as expected, write performance is unaffected by writing more data.

Scaling:
The above was single node testing.  I'd expect to be able to add nodes and 
scale throughput.  Unfortunately, I seem to be running into a cap of 21,000 
reads/s regardless of the number of nodes in the cluster.  In order to better 
understand this, I eliminated the factors of data size, caching, replication 
etc. and ran read tests on empty clusters (every read a miss - bouncing off the 
bloom filter and straight back).  1 node gives 24,000 reads/s while 2,3,4... 
give 21,000 (presumably the bump in single node performance is due to the lack 
of the extra hop).  With CPU, disk, and RAM all largely unused, I'm at a loss 
to explain the lack of additional throughput.  I tried increasing the number of 
clients but that just split the throughput down the middle with each stress.py 
achieving roughly 10,000 reads/s.  I'm running the clients (stress.py) on 
separate cloud servers.

I checked the ulimit file count and I'm not limiting connections there.  It 
seems like there's a problem with my test setup - a clear bottleneck somewhere, 
but I just don't see what it is.  Any ideas?

Also:
The disk performance of the cloud servers have been extremely spotty.  The 
results I posted above were reproducible whenever the servers were in their 
"normal" state.  But for periods of as much as several consecutive hours, 
single servers or groups of servers in the cloud would suddenly have horrendous 
disk performance as measured by dstat and iostat.  The "% steal" by hypervisor 
on these nodes is also quite high (> 30%).   The performance during these "bad" 
periods drops from 4,800reads/s in the single node benchmark to just 
200reads/s.  The node is effectively useless.  Is this normal for the cloud?  
And if so, what's the solution re Cassandra?  Obviously you can just keep 
adding more nodes until the likelihood that there is at least one good server 
with every piece of the data is reasonable.  However, Cassandra routes to the 
nearest node topologically and not to the best performing one, so "bad" nodes 
will always result in high latency reads.  How are you guys that are running in 
the cloud dealing with this?  Are you seeing this at all?

Thanks in advance for your feedback and advice,

  -- Oren


Reply via email to