Re: Using a separate commit log drive was 4x slower
Thanks for all the feedback, I'll be back in 2 weeks and pick up then. -JD On Tue, Aug 10, 2010 at 3:45 PM, Peter Schuller wrote: > > Yeah, it has a BBU, and it is charged and on.. > > Very odd behavior, I'm stumped. > > I advise double-checking raid volume settings and ensuring that policy > is truly such that the write cache is used. This may also be a > function of kernel driver settings depending on what RAID > controller/kernel version you have (for example make sure that an > fsync() doesn't result in asking the RAID controller to flush caches > regardless of BBU state). In any case, these stats: > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > cciss/c0d00.00 908.500.00 110.50 0.00 8152.00 > 73.77 0.575.20 4.93 54.50 > > ... are highly inconsistent with write-back caching under the > assumption that the writes are indeed the sequential writes to the > commit log. Write counts in the ~ 100/second ballpark, with no reads, > average request size of 74, an average transaction time of 4.93 and a > utilization of 54% *really really* sounds like the I/O is going all > the way down to the platter. Either that or the RAID firmware/driver > is not doing its job properly. > > I've attached a small script (note: UGLY hack since I just whipped it > up, not a proper tool, but it does the job for this) that you can run > to test it: > > ./writelat.py /path/to/file_to_write_to # warning, file will be > truncated/destroyed if it exists > > If you run this on an idle system and you're truly doing write-back > caching, you should see numbers BELOW 1 (i.e., sub-millisecond times) > (but you can ignore the first sample probably). > > It writes one 8kb chunk of data, fsync:s, writes another 8kb, fsyncs, > etc. I predict you'll see numbers in the 3-7 millisecond range. > > Sample output for me with a plain old SATA drive (and ZFS on FreeBSD) > follows. > > 33.8141918182 > 8.32605361938 > 8.44812393188 > 8.44788551331 > 8.40210914612 > 8.4490776062 > 8.38303565979 > 8.57901573181 > 8.20922851562 > 8.21614265442 > 10.0581645966 > 8.37683677673 > 8.50605964661 > 8.376121521 > 9.86790657043 > 8.43715667725 > 8.45789909363 > 8.40520858765 > 8.4171295166 > 8.46195220947 > 8.41498374939 > 8.46099853516 > 8.44287872314 > 8.43000411987 > 8.455991745 > > -- > / Peter Schuller >
Re: Using a separate commit log drive was 4x slower
> Yeah, it has a BBU, and it is charged and on.. > Very odd behavior, I'm stumped. I advise double-checking raid volume settings and ensuring that policy is truly such that the write cache is used. This may also be a function of kernel driver settings depending on what RAID controller/kernel version you have (for example make sure that an fsync() doesn't result in asking the RAID controller to flush caches regardless of BBU state). In any case, these stats: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util cciss/c0d00.00 908.500.00 110.50 0.00 8152.00 73.77 0.575.20 4.93 54.50 ... are highly inconsistent with write-back caching under the assumption that the writes are indeed the sequential writes to the commit log. Write counts in the ~ 100/second ballpark, with no reads, average request size of 74, an average transaction time of 4.93 and a utilization of 54% *really really* sounds like the I/O is going all the way down to the platter. Either that or the RAID firmware/driver is not doing its job properly. I've attached a small script (note: UGLY hack since I just whipped it up, not a proper tool, but it does the job for this) that you can run to test it: ./writelat.py /path/to/file_to_write_to # warning, file will be truncated/destroyed if it exists If you run this on an idle system and you're truly doing write-back caching, you should see numbers BELOW 1 (i.e., sub-millisecond times) (but you can ignore the first sample probably). It writes one 8kb chunk of data, fsync:s, writes another 8kb, fsyncs, etc. I predict you'll see numbers in the 3-7 millisecond range. Sample output for me with a plain old SATA drive (and ZFS on FreeBSD) follows. 33.8141918182 8.32605361938 8.44812393188 8.44788551331 8.40210914612 8.4490776062 8.38303565979 8.57901573181 8.20922851562 8.21614265442 10.0581645966 8.37683677673 8.50605964661 8.376121521 9.86790657043 8.43715667725 8.45789909363 8.40520858765 8.4171295166 8.46195220947 8.41498374939 8.46099853516 8.44287872314 8.43000411987 8.455991745 -- / Peter Schuller #!/usr/bin/env python # ugly hack, reader beware import os import sys import time COUNT = 25 BLOCK = ''.join('x' for _ in xrange(8*1024)) f = file(sys.argv[1], 'w') for _ in xrange(COUNT): start_time = time.time() f.write(BLOCK) f.flush() os.fsync(f.fileno()) stop_time = time.time() # uncomment to test-rewriting the same block repeatedly # f.seek(0) print (stop_time - start_time) * 1000 f.close()
Re: Using a separate commit log drive was 4x slower
Other activity, e.g. syslog? Journaling at the FS level? you could try making a small partition formatted as ext2. On Tue, Aug 10, 2010 at 5:10 PM, Jeremy Davis wrote: > Yeah, it has a BBU, and it is charged and on.. > Very odd behavior, I'm stumped. > > -JD > > On Tue, Aug 10, 2010 at 12:28 AM, Peter Schuller > wrote: >> >> I have no explanation for the slower reads, but I have an hypothesis >> on the writes. >> >> Your iostat shows: >> >> > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> > avgrq-sz avgqu-sz await svctm %util >> > cciss/c0d0 0.00 908.50 0.00 110.50 0.00 8152.00 >> > 73.77 0.57 5.20 4.93 54.50 >> > cciss/c0d1 0.00 0.00 16.50 0.00 1424.00 0.00 >> > 86.30 0.10 6.06 2.73 4.50 >> > dm-0 0.00 0.00 0.00 1019.00 0.00 8152.00 >> > 8.00 6.25 6.13 0.53 54.50 >> > dm-1 0.00 0.00 0.00 0.00 0.00 0.00 >> > 0.00 0.00 0.00 0.00 0.00 >> >> So that's keeping the disk busy around 50-60% of the time. This seems >> roughly consistent with your commit batch window being set to 5 ms and >> the system drive NOT being supported by a battery-backed cache (such >> that an fsync() actually does have to wait for the data to be on the >> platter). >> >> Is your non-system drive backed by the BBU? (I'm not sure if the >> controllers would support having some volumes backed by BBU protected >> cache and not others.) >> >> If it is the case that the other volume is BBU backed, then maybe the >> slowdown on writes is due to this. In any case, whatever the situation >> was before, the above stats do seem roughly consistent with >> write-through fsync() and batch window of 5ms, given sufficient >> concurrency to achieve the throughput you're seeing. On the other >> hand, "roughly consistent" is not very precise, and the original >> performance on the RAID:ed device is probably also roughly consistent >> with this ;) >> >> -- >> / Peter Schuller > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Using a separate commit log drive was 4x slower
Yeah, it has a BBU, and it is charged and on.. Very odd behavior, I'm stumped. -JD On Tue, Aug 10, 2010 at 12:28 AM, Peter Schuller < peter.schul...@infidyne.com> wrote: > I have no explanation for the slower reads, but I have an hypothesis > on the writes. > > Your iostat shows: > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > > cciss/c0d00.00 908.500.00 110.50 0.00 8152.00 > 73.77 0.575.20 4.93 54.50 > > cciss/c0d10.00 0.00 16.500.00 1424.00 0.00 > 86.30 0.106.06 2.73 4.50 > > dm-0 0.00 0.000.00 1019.00 0.00 8152.00 > 8.00 6.256.13 0.53 54.50 > > dm-1 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > > So that's keeping the disk busy around 50-60% of the time. This seems > roughly consistent with your commit batch window being set to 5 ms and > the system drive NOT being supported by a battery-backed cache (such > that an fsync() actually does have to wait for the data to be on the > platter). > > Is your non-system drive backed by the BBU? (I'm not sure if the > controllers would support having some volumes backed by BBU protected > cache and not others.) > > If it is the case that the other volume is BBU backed, then maybe the > slowdown on writes is due to this. In any case, whatever the situation > was before, the above stats do seem roughly consistent with > write-through fsync() and batch window of 5ms, given sufficient > concurrency to achieve the throughput you're seeing. On the other > hand, "roughly consistent" is not very precise, and the original > performance on the RAID:ed device is probably also roughly consistent > with this ;) > > -- > / Peter Schuller >
Re: Using a separate commit log drive was 4x slower
I have no explanation for the slower reads, but I have an hypothesis on the writes. Your iostat shows: > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > cciss/c0d0 0.00 908.50 0.00 110.50 0.00 8152.00 > 73.77 0.57 5.20 4.93 54.50 > cciss/c0d1 0.00 0.00 16.50 0.00 1424.00 0.00 > 86.30 0.10 6.06 2.73 4.50 > dm-0 0.00 0.00 0.00 1019.00 0.00 8152.00 > 8.00 6.25 6.13 0.53 54.50 > dm-1 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 So that's keeping the disk busy around 50-60% of the time. This seems roughly consistent with your commit batch window being set to 5 ms and the system drive NOT being supported by a battery-backed cache (such that an fsync() actually does have to wait for the data to be on the platter). Is your non-system drive backed by the BBU? (I'm not sure if the controllers would support having some volumes backed by BBU protected cache and not others.) If it is the case that the other volume is BBU backed, then maybe the slowdown on writes is due to this. In any case, whatever the situation was before, the above stats do seem roughly consistent with write-through fsync() and batch window of 5ms, given sufficient concurrency to achieve the throughput you're seeing. On the other hand, "roughly consistent" is not very precise, and the original performance on the RAID:ed device is probably also roughly consistent with this ;) -- / Peter Schuller
Using a separate commit log drive was 4x slower
I have a weird one to share with the list, Using a separate commit log drive dropped my performance a lot more than I would expect... I'm doing perf tests on 3 identical machines but with 3 different drive sets. (SAS 15K,10K, and SATA 7.5K) Each system has a single system disk (Same as the data set) and the data set ( a 5 drive RAID-0 ) I'm using Cassandra 0.6.4 with Java 1.6_20. This is all RF=1, CL=1 I inserted an initial data set of 100K keys (each with 1000 columns of random data (1000 bytes). Compacted and restarted Cassandra. Then I did a write baseline where I have 500 threads inserting a random 1000 bytes on a random key/column combination (always 1 column per request). If my commit log is on my RAID'd Data drive I get about 19K Columns/Inserts per second. If I then add some random reads ( 30 threads doing a random read Key/Column read - always 1 column per read) I get ~ 8K Reads/Writes per second Host Write Baseline Columns Per Second. Write Columns Per Second. Read Columns Per Second. SAS15K 18800 8700 8100 SAS10K 15800 7600 7300 SATA 13200 7300 8000 Now, if I do the same thing but with the commit log on the system disk, I get: Host Write Baseline Columns Per Second. Write Columns Per Second. Read Columns Per Second. SAS15K 12600 2200 1600 SAS10K 11400 3000 1900 SATA 9900 3100 1800 I would think that the Write level would stay at about the baseline, and I have no idea why the read level would be so low. Any thoughts? Some iostat (while separate commit log): avg-cpu: %user %nice %system %iowait %steal %idle 38.330.004.722.480.00 54.47 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util cciss/c0d00.00 908.500.00 110.50 0.00 8152.00 73.77 0.575.20 4.93 54.50 cciss/c0d10.00 0.00 16.500.00 1424.00 0.00 86.30 0.106.06 2.73 4.50 dm-0 0.00 0.000.00 1019.00 0.00 8152.00 8.00 6.256.13 0.53 54.50 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 37.420.002.373.540.00 56.68 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util cciss/c0d00.00 854.500.00 124.50 0.00 7816.00 62.78 0.614.94 4.82 60.00 cciss/c0d10.00 0.00 32.000.00 4032.00 0.00 126.00 0.216.72 3.12 10.00 dm-0 0.00 0.000.00 979.50 0.00 7836.00 8.00 5.575.69 0.61 60.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 some top (while separate commit log): top - 15:56:38 up 6 days, 21:26, 9 users, load average: 17.09, 7.92, 6.87 Tasks: 358 total, 1 running, 357 sleeping, 0 stopped, 0 zombie Cpu(s): 35.4%us, 1.6%sy, 0.0%ni, 59.4%id, 3.2%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 24729068k total, 19789732k used, 4939336k free, 132056k buffers Swap: 5849080k total,54976k used, 5794104k free, 14839884k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 19411 root 20 0 142g 8.2g 5.0g S 599 34.9 423:25.42 java Test Cluster false true org.apache.cassandra.locator.RackUnawareStrategy 1 org.apache.cassandra.locator.EndPointSnitch org.apache.cassandra.auth.AllowAllAuthenticator org.apache.cassandra.dht.OrderPreservingPartitioner /data/commitlog /data/data 127.0.0.1 1 1024 10.2.60.20 7000 10.2.60.20 9160 false auto 512 64 32 8 64 512 256 1.2 60 16 512 batch 5 864000