Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
I didn't know you use actual key instead its md5 (for random patitioner) in KCF. It's good point that I'll watch hit ratio of KCF to determine whether it needs to be increased. Thanks, -Weijun On Tue, Feb 16, 2010 at 5:34 PM, Jonathan Ellis wrote: > On Tue, Feb 16, 2010 at 7:27 PM, Weijun Li wrote: > > Yes my KeysCachedFraction is already 0.3 but it doesn't relief the disk > i/o. > > I compacted the data to be a 60GB (took quite a while to finish and it > > increased latency as expected) one but doesn't help much either. > > > > If I set KCF to 1 (meaning to cache all sstable index), how much memory > will > > it take for 50mil keys? > > 10/3 what 0.3 takes :) > > >Is the index a straight key-offset map? I guess key > > is 16 bytes and offset is 8 bytes. > > key length depends on your data, of course. > > > Will KCF=1 help to reduce disk i/o? > > depends. w/ trunk you can look at your cache hit rate w/ jconsole to > see if increasing it more would help. > > -Jonathan >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
On Tue, Feb 16, 2010 at 7:27 PM, Weijun Li wrote: > Yes my KeysCachedFraction is already 0.3 but it doesn't relief the disk i/o. > I compacted the data to be a 60GB (took quite a while to finish and it > increased latency as expected) one but doesn't help much either. > > If I set KCF to 1 (meaning to cache all sstable index), how much memory will > it take for 50mil keys? 10/3 what 0.3 takes :) >Is the index a straight key-offset map? I guess key > is 16 bytes and offset is 8 bytes. key length depends on your data, of course. > Will KCF=1 help to reduce disk i/o? depends. w/ trunk you can look at your cache hit rate w/ jconsole to see if increasing it more would help. -Jonathan
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
Yes my KeysCachedFraction is already 0.3 but it doesn't relief the disk i/o. I compacted the data to be a 60GB (took quite a while to finish and it increased latency as expected) one but doesn't help much either. If I set KCF to 1 (meaning to cache all sstable index), how much memory will it take for 50mil keys? Is the index a straight key-offset map? I guess key is 16 bytes and offset is 8 bytes. Will KCF=1 help to reduce disk i/o? -Weijun On Tue, Feb 16, 2010 at 5:18 PM, Jonathan Ellis wrote: > Have you tried increasing KeysCachedFraction? > > On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li wrote: > > Still have high read latency with 50mil records in the 2-node cluster > > (replica 2). I restarted both nodes but read latency is still above 60ms > and > > disk i/o saturation is high. Tried compact and repair but doesn't help > much. > > When I reduced the client threads from 15 to 5 it looks a lot better but > > throughput is kind of low. I changed using flushing thread of 16 instead > the > > defaulted 8, could that cause the disk saturation issue? > > > > For benchmark with decent throughput and latency, how many client threads > do > > they use? Can anyone share your storage-conf.xml in well-tuned high > volume > > cluster? > > > > -Weijun > > > > On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood > wrote: > >> > >> > After I ran "nodeprobe compact" on node B its read latency went up to > >> > 150ms. > >> The compaction process can take a while to finish... in 0.5 you need to > >> watch the logs to figure out when it has actually finished, and then you > >> should start seeing the improvement in read latency. > >> > >> > Is there any way to utilize all of the heap space to decrease the read > >> > latency? > >> In 0.5 you can adjust the number of keys that are cached by changing the > >> 'KeysCachedFraction' parameter in your config file. In 0.6 you can > >> additionally cache rows. You don't want to use up all of the memory on > your > >> box for those caches though: you'll want to leave at least 50% for your > OS's > >> disk cache, which will store the full row content. > >> > >> > >> -Original Message- > >> From: "Weijun Li" > >> Sent: Tuesday, February 16, 2010 12:16pm > >> To: cassandra-user@incubator.apache.org > >> Subject: Re: Cassandra benchmark shows OK throughput but high read > latency > >> (> 100ms)? > >> > >> Thanks for for DataFileDirectory trick and I'll give a try. > >> > >> Just noticed the impact of number of data files: node A has 13 data > files > >> with read latency of 20ms and node B has 27 files with read latency of > >> 60ms. > >> After I ran "nodeprobe compact" on node B its read latency went up to > >> 150ms. > >> The read latency of node A became as low as 10ms. Is this normal > behavior? > >> I'm using random partitioner and the hardware/JVM settings are exactly > the > >> same for these two nodes. > >> > >> Another problem is that Java heap usage is always 900mb out of 6GB? Is > >> there > >> any way to utilize all of the heap space to decrease the read latency? > >> > >> -Weijun > >> > >> On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams > >> wrote: > >> > >> > On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li > wrote: > >> > > >> >> One more thoughts about Martin's suggestion: is it possible to put > the > >> >> data files into multiple directories that are located in different > >> >> physical > >> >> disks? This should help to improve the i/o bottleneck issue. > >> >> > >> >> > >> > Yes, you can already do this, just add more > >> > directives > >> > pointed at multiple drives. > >> > > >> > > >> >> Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? > >> > > >> > > >> > Row cache and key cache both help tremendously if your read pattern > has > >> > a > >> > decent repeat rate. Completely random io can only be so fast, > however. > >> > > >> > -Brandon > >> > > >> > >> > > > > >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
Have you tried increasing KeysCachedFraction? On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li wrote: > Still have high read latency with 50mil records in the 2-node cluster > (replica 2). I restarted both nodes but read latency is still above 60ms and > disk i/o saturation is high. Tried compact and repair but doesn't help much. > When I reduced the client threads from 15 to 5 it looks a lot better but > throughput is kind of low. I changed using flushing thread of 16 instead the > defaulted 8, could that cause the disk saturation issue? > > For benchmark with decent throughput and latency, how many client threads do > they use? Can anyone share your storage-conf.xml in well-tuned high volume > cluster? > > -Weijun > > On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood wrote: >> >> > After I ran "nodeprobe compact" on node B its read latency went up to >> > 150ms. >> The compaction process can take a while to finish... in 0.5 you need to >> watch the logs to figure out when it has actually finished, and then you >> should start seeing the improvement in read latency. >> >> > Is there any way to utilize all of the heap space to decrease the read >> > latency? >> In 0.5 you can adjust the number of keys that are cached by changing the >> 'KeysCachedFraction' parameter in your config file. In 0.6 you can >> additionally cache rows. You don't want to use up all of the memory on your >> box for those caches though: you'll want to leave at least 50% for your OS's >> disk cache, which will store the full row content. >> >> >> -Original Message----- >> From: "Weijun Li" >> Sent: Tuesday, February 16, 2010 12:16pm >> To: cassandra-user@incubator.apache.org >> Subject: Re: Cassandra benchmark shows OK throughput but high read latency >> (> 100ms)? >> >> Thanks for for DataFileDirectory trick and I'll give a try. >> >> Just noticed the impact of number of data files: node A has 13 data files >> with read latency of 20ms and node B has 27 files with read latency of >> 60ms. >> After I ran "nodeprobe compact" on node B its read latency went up to >> 150ms. >> The read latency of node A became as low as 10ms. Is this normal behavior? >> I'm using random partitioner and the hardware/JVM settings are exactly the >> same for these two nodes. >> >> Another problem is that Java heap usage is always 900mb out of 6GB? Is >> there >> any way to utilize all of the heap space to decrease the read latency? >> >> -Weijun >> >> On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams >> wrote: >> >> > On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li wrote: >> > >> >> One more thoughts about Martin's suggestion: is it possible to put the >> >> data files into multiple directories that are located in different >> >> physical >> >> disks? This should help to improve the i/o bottleneck issue. >> >> >> >> >> > Yes, you can already do this, just add more >> > directives >> > pointed at multiple drives. >> > >> > >> >> Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? >> > >> > >> > Row cache and key cache both help tremendously if your read pattern has >> > a >> > decent repeat rate. Completely random io can only be so fast, however. >> > >> > -Brandon >> > >> >> > >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
Still have high read latency with 50mil records in the 2-node cluster (replica 2). I restarted both nodes but read latency is still above 60ms and disk i/o saturation is high. Tried compact and repair but doesn't help much. When I reduced the client threads from 15 to 5 it looks a lot better but throughput is kind of low. I changed using flushing thread of 16 instead the defaulted 8, could that cause the disk saturation issue? For benchmark with decent throughput and latency, how many client threads do they use? Can anyone share your storage-conf.xml in well-tuned high volume cluster? -Weijun On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood wrote: > > After I ran "nodeprobe compact" on node B its read latency went up to > 150ms. > The compaction process can take a while to finish... in 0.5 you need to > watch the logs to figure out when it has actually finished, and then you > should start seeing the improvement in read latency. > > > Is there any way to utilize all of the heap space to decrease the read > latency? > In 0.5 you can adjust the number of keys that are cached by changing the > 'KeysCachedFraction' parameter in your config file. In 0.6 you can > additionally cache rows. You don't want to use up all of the memory on your > box for those caches though: you'll want to leave at least 50% for your OS's > disk cache, which will store the full row content. > > > -Original Message- > From: "Weijun Li" > Sent: Tuesday, February 16, 2010 12:16pm > To: cassandra-user@incubator.apache.org > Subject: Re: Cassandra benchmark shows OK throughput but high read latency > (> 100ms)? > > Thanks for for DataFileDirectory trick and I'll give a try. > > Just noticed the impact of number of data files: node A has 13 data files > with read latency of 20ms and node B has 27 files with read latency of > 60ms. > After I ran "nodeprobe compact" on node B its read latency went up to > 150ms. > The read latency of node A became as low as 10ms. Is this normal behavior? > I'm using random partitioner and the hardware/JVM settings are exactly the > same for these two nodes. > > Another problem is that Java heap usage is always 900mb out of 6GB? Is > there > any way to utilize all of the heap space to decrease the read latency? > > -Weijun > > On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams > wrote: > > > On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li wrote: > > > >> One more thoughts about Martin's suggestion: is it possible to put the > >> data files into multiple directories that are located in different > physical > >> disks? This should help to improve the i/o bottleneck issue. > >> > >> > > Yes, you can already do this, just add more > directives > > pointed at multiple drives. > > > > > >> Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? > > > > > > Row cache and key cache both help tremendously if your read pattern has a > > decent repeat rate. Completely random io can only be so fast, however. > > > > -Brandon > > > > >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
> After I ran "nodeprobe compact" on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. > Is there any way to utilize all of the heap space to decrease the read > latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: "Weijun Li" Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran "nodeprobe compact" on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams wrote: > On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li wrote: > >> One more thoughts about Martin's suggestion: is it possible to put the >> data files into multiple directories that are located in different physical >> disks? This should help to improve the i/o bottleneck issue. >> >> > Yes, you can already do this, just add more directives > pointed at multiple drives. > > >> Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? > > > Row cache and key cache both help tremendously if your read pattern has a > decent repeat rate. Completely random io can only be so fast, however. > > -Brandon >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
On Tue, Feb 16, 2010 at 12:16 PM, Weijun Li wrote: > Thanks for for DataFileDirectory trick and I'll give a try. > > Just noticed the impact of number of data files: node A has 13 data files > with read latency of 20ms and node B has 27 files with read latency of 60ms. > After I ran "nodeprobe compact" on node B its read latency went up to 150ms. > The read latency of node A became as low as 10ms. Is this normal behavior? > I'm using random partitioner and the hardware/JVM settings are exactly the > same for these two nodes. > It sounds like the latency jumped to 150ms because the newly written file was not in the OS cache. Another problem is that Java heap usage is always 900mb out of 6GB? Is there > any way to utilize all of the heap space to decrease the read latency? By default, Cassandra will use a 1GB heap, as set in bin/cassandra.in.sh. You can adjust the jvm heap there via the -Xmx option, but generally you want to balance the jvm vs the OS cache. With 6GB, I would probably give 2GB to the jvm, but if you aren't having issues now increasing the jvm's memory probably won't provide any performance gains, but it's worth noting that with row cache in 0.6 this may change. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran "nodeprobe compact" on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams wrote: > On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li wrote: > >> One more thoughts about Martin's suggestion: is it possible to put the >> data files into multiple directories that are located in different physical >> disks? This should help to improve the i/o bottleneck issue. >> >> > Yes, you can already do this, just add more directives > pointed at multiple drives. > > >> Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? > > > Row cache and key cache both help tremendously if your read pattern has a > decent repeat rate. Completely random io can only be so fast, however. > > -Brandon >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li wrote: > One more thoughts about Martin's suggestion: is it possible to put the data > files into multiple directories that are located in different physical > disks? This should help to improve the i/o bottleneck issue. > > Yes, you can already do this, just add more directives pointed at multiple drives. > Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
On Tue, Feb 16, 2010 at 11:50 AM, Weijun Li wrote: > Dumped 50mil records into my 2-node cluster overnight, made sure that > there's not many data files (around 30 only) per Martin's suggestion. The > size of the data directory is 63GB. Now when I read records from the cluster > the read latency is still ~44ms, --there's no write happening during the > read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is > saturated: > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 > 5.24 25.25 4.64 96.17 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.000.00 0.00 0.00 > sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 > 5.24 25.25 4.64 96.17 > sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.000.00 0.00 0.00 > > CPU usage is low. > > Does this mean disk i/o is the bottleneck for my case? Will it help if I > increase KCF to cache all sstable index? > > That's exactly what this means. Disk is slow :( > Also, this is the almost a read-only mode test, and in reality, our > write/read ratio is close to 1:1 so I'm guessing read latency will even go > higher in that case because there will be difficult for cassandra to find a > good moment to compact the data files that are being busy written. > Reads that cause disk seeks are always going to slow things down, since disk seeks are inherently the slowest operation in a machine. Writes in Cassandra should always be fast, as they do not cause any disk seeks. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? -Weijun On Tue, Feb 16, 2010 at 9:50 AM, Weijun Li wrote: > Dumped 50mil records into my 2-node cluster overnight, made sure that > there's not many data files (around 30 only) per Martin's suggestion. The > size of the data directory is 63GB. Now when I read records from the cluster > the read latency is still ~44ms, --there's no write happening during the > read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is > saturated: > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 > 5.24 25.25 4.64 96.17 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.000.00 0.00 0.00 > sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 > 5.24 25.25 4.64 96.17 > sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.000.00 0.00 0.00 > > CPU usage is low. > > Does this mean disk i/o is the bottleneck for my case? Will it help if I > increase KCF to cache all sstable index? > > Also, this is the almost a read-only mode test, and in reality, our > write/read ratio is close to 1:1 so I'm guessing read latency will even go > higher in that case because there will be difficult for cassandra to find a > good moment to compact the data files that are being busy written. > > Thanks, > -Weijun > > > > On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams wrote: > >> On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller < >> martin.grabmuel...@eleven.de> wrote: >> >>> In my tests I have observed that good read latency depends on keeping >>> the number of data files low. In my current test setup, I have stored >>> 1.9 TB of data on a single node, which is in 21 data files, and read >>> latency is between 10 and 60ms (for small reads, larger read of course >>> take more time). In earlier stages of my test, I had up to 5000 >>> data files, and read performance was quite bad: my configured 10-second >>> RPC timeout was regularly encountered. >>> >> >> I believe it is known that crossing sstables is O(NlogN) but I'm unable to >> find the ticket on this at the moment. Perhaps Stu Hood will jump in and >> enlighten me, but in any case I believe >> https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve >> it. >> >> Keeping write volume low enough that compaction can keep up is one >> solution, and throwing hardware at the problem is another, if necessary. >> Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for >> repeat hits. >> >> -Brandon >> > >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Thanks, -Weijun On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams wrote: > On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller < > martin.grabmuel...@eleven.de> wrote: > >> In my tests I have observed that good read latency depends on keeping >> the number of data files low. In my current test setup, I have stored >> 1.9 TB of data on a single node, which is in 21 data files, and read >> latency is between 10 and 60ms (for small reads, larger read of course >> take more time). In earlier stages of my test, I had up to 5000 >> data files, and read performance was quite bad: my configured 10-second >> RPC timeout was regularly encountered. >> > > I believe it is known that crossing sstables is O(NlogN) but I'm unable to > find the ticket on this at the moment. Perhaps Stu Hood will jump in and > enlighten me, but in any case I believe > https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve > it. > > Keeping write volume low enough that compaction can keep up is one > solution, and throwing hardware at the problem is another, if necessary. > Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for > repeat hits. > > -Brandon >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller < martin.grabmuel...@eleven.de> wrote: > In my tests I have observed that good read latency depends on keeping > the number of data files low. In my current test setup, I have stored > 1.9 TB of data on a single node, which is in 21 data files, and read > latency is between 10 and 60ms (for small reads, larger read of course > take more time). In earlier stages of my test, I had up to 5000 > data files, and read performance was quite bad: my configured 10-second > RPC timeout was regularly encountered. > I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it. Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits. -Brandon
RE: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
> The other problem is: if I keep mixed write and read (e.g, 8 > write threads > plus 7 read threads) against the 2-nodes cluster > continuously, the read > latency will go up gradually (along with the size of > Cassandra data file), > and at the end it will become ~40ms (up from ~20ms) even with only 15 > threads. During this process the data file grew from 1.6GB to > over 3GB even > if I kept writing the same key/values to Cassandra. It seems > that Cassandra > keeps appending to sstable data files and will only clean up > them during > node cleanup or compact (please correct me if this is incorrect). In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. The number of data files is reduced whenever Cassandra compacts them, which is either automatically, when enough datafiles are generated by continuous writing, or when triggered by nodeprobe compact, cleanup etc. So my advice is to keep the write throughput low enough so that Cassandra can keep up compacting the data files. For high write throughput, you need fast drives, if possible on different RAIDs, which are configured as different DataDirectories for Cassandra. On my setup (6 drives in a single RAID-5 configuration), compaction is quite slow: sequential reads/writes are done at 150 MB/s, whereas during compaction, read/write-performance drops to a few MB/s. You definitively want more than one logical drive, so that Cassandra can alternate between them when flushin memtables and when compacting. I would really be interested whether my observations are shared by other people on this list. Thanks! Martin
RE: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
It seems that read latency is sensitive to number of threads (or thrift clients): after reducing number of threads to 15 and read latency decreased to ~20ms. The other problem is: if I keep mixed write and read (e.g, 8 write threads plus 7 read threads) against the 2-nodes cluster continuously, the read latency will go up gradually (along with the size of Cassandra data file), and at the end it will become ~40ms (up from ~20ms) even with only 15 threads. During this process the data file grew from 1.6GB to over 3GB even if I kept writing the same key/values to Cassandra. It seems that Cassandra keeps appending to sstable data files and will only clean up them during node cleanup or compact (please correct me if this is incorrect). Here's my test settings: JVM xmx: 6GB KCF: 0.3 Memtable: 512MB. Number of records: 1 millon (payload is 1000 bytes) I used JMX and iostat to watch the cluster but can't find any clue for the increasing read latency issue: JVM memory, GC, CPU usage, tpstats and io saturation all seem to be clean. One exception is that the wait time in iostat goes up quickly once a while but is a small number for most of the time. Another thing I noticed is that JVM doesn't use more than 1GB of memory (out of the 6GB I specified for JVM) even if I set KCF to 0.3 and increased memtable size to 512MB. Did I miss anything here? How can I diagnose this kind of increasing read latency issue? Is there any performance tuning guide available? Thanks, -Weijun -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Sunday, February 14, 2010 6:22 PM To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)? are you i/o bound? what is your on-disk data set size? what does iostats tell you? http://spyced.blogspot.com/2010/01/linux-performance-basics.html do you have a lot of pending compactions? (tpstats will tell you) have you increased KeysCachedFraction? On Sun, Feb 14, 2010 at 8:18 PM, Weijun Li wrote: > Hello, > > > > I saw some Cassandra benchmark reports mentioning read latency that is less > than 50ms or even 30ms. But my benchmark with 0.5 doesn't seem to support > that. Here's my settings: > > > > Nodes: 2 machines. 2x2.5GHZ Xeon Quad Core (thus 8 cores), 8GB RAM > > ReplicationFactor=2 Partitioner=Random > > JVM Xmx: 4GB > > Memory table size: 512MB (haven't figured out how to enable binary memtable > so I set both memtable number to 512mb) > > Flushing threads: 2-4 > > Payload: ~1000 bytes, 3 columns in one CF. > > Read/write time measure: get startTime right before each Java thrift call, > transport objects are pre-created upon creation of each thread. > > > > The result shows that total write throughput is around 2000/sec (for 2 nodes > in the cluster) which is not bad, and read throughput is just around > 750/sec. However for each thread the average read latency is more than > 100ms. I'm running 100 threads for the testing and each thread randomly pick > a node for thrift call. So the read/sec of each thread is just around 7.5, > meaning duration of each thrift call is 1000/7.5=133ms. Without replication > the cluster write throughput is around 3300/s, and read throughput is around > 1400/s, so the read latency is still around 70ms without replication. > > > > Is there anything wrong in my benchmark test? How can I achieve a reasonable > read latency (< 30ms)? > > > > Thanks, > > -Weijun > > > >
Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)?
are you i/o bound? what is your on-disk data set size? what does iostats tell you? http://spyced.blogspot.com/2010/01/linux-performance-basics.html do you have a lot of pending compactions? (tpstats will tell you) have you increased KeysCachedFraction? On Sun, Feb 14, 2010 at 8:18 PM, Weijun Li wrote: > Hello, > > > > I saw some Cassandra benchmark reports mentioning read latency that is less > than 50ms or even 30ms. But my benchmark with 0.5 doesn’t seem to support > that. Here’s my settings: > > > > Nodes: 2 machines. 2x2.5GHZ Xeon Quad Core (thus 8 cores), 8GB RAM > > ReplicationFactor=2 Partitioner=Random > > JVM Xmx: 4GB > > Memory table size: 512MB (haven’t figured out how to enable binary memtable > so I set both memtable number to 512mb) > > Flushing threads: 2-4 > > Payload: ~1000 bytes, 3 columns in one CF. > > Read/write time measure: get startTime right before each Java thrift call, > transport objects are pre-created upon creation of each thread. > > > > The result shows that total write throughput is around 2000/sec (for 2 nodes > in the cluster) which is not bad, and read throughput is just around > 750/sec. However for each thread the average read latency is more than > 100ms. I’m running 100 threads for the testing and each thread randomly pick > a node for thrift call. So the read/sec of each thread is just around 7.5, > meaning duration of each thrift call is 1000/7.5=133ms. Without replication > the cluster write throughput is around 3300/s, and read throughput is around > 1400/s, so the read latency is still around 70ms without replication. > > > > Is there anything wrong in my benchmark test? How can I achieve a reasonable > read latency (< 30ms)? > > > > Thanks, > > -Weijun > > > >