Re: Mysteriously poor write performance
Sorry for the delayed reply... I've been tracking some issues which cause high latency on our test machines, and it may be responsible for your problems as well. Could you retry those runs with the same debugging and 'journal dio' set to false? Thanks for your patience, -Sam On Sat, Mar 24, 2012 at 12:09 PM, Andrey Korolyov wrote: > http://xdel.ru/downloads/ceph-logs-dbg/ > > On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just wrote: >> (CCing the list) >> >> Actually, can you could re-do the rados bench run with 'debug journal >> = 20' along with the other debugging? That should give us better >> information. >> >> -Sam >> >> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov wrote: >>> Hi Sam, >>> >>> Can you please suggest on where to start profiling osd? If the >>> bottleneck has related to such non-complex things as directio speed, >>> I`m sure that I was able to catch it long ago, even crossing around by >>> results of other types of benchmarks at host system. I`ve just tried >>> tmpfs under both journals, it has a small boost effect, as expected >>> because of near-zero i/o delay. May be chunk distribution mechanism >>> does not work well on such small amount of nodes but right now I don`t >>> have necessary amount of hardware nodes to prove or disprove that. >>> >>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov wrote: random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=9647 write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: > Our journal writes are actually sequential. Could you send FIO > results for sequential 4k writes osd.0's journal and osd.1's journal? > -Sam > > On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: >> FIO output for journal partition, directio enabled, seems good(same >> results for ext4 on other single sata disks). >> >> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=21926 >> write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec >> clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 >> bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, >> stdev=480.05 >> cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% >> lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% >> lat (msec): 500=0.04% >> >> >> >> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just >> wrote: >>> (CCing the list) >>> >>> So, the problem isn't the bandwidth. Before we respond to the client, >>> we write the operation to the journal. In this case, that operation >>> is taking >1s per operation on osd.1. Both rbd and rados bench will >>> only allow a limited number of ops in flight at a time, so this >>> latency is killing your throughput. For comparison, the latency for >>> writing to the journal on osd.0 is < .3s. Can you measure direct io >>> latency for writes to your osd.1 journal file? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, not Megabits. On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: > [global] > log dir = /ceph/out > log_file = "" > logger dir = /ceph/log > pid file = /ceph/out/$type$id.pid > [mds] > pid file = /ceph/out/$name.pid > lockdep = 1
Re: Mysteriously poor write performance
http://xdel.ru/downloads/ceph-logs-dbg/ On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just wrote: > (CCing the list) > > Actually, can you could re-do the rados bench run with 'debug journal > = 20' along with the other debugging? That should give us better > information. > > -Sam > > On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov wrote: >> Hi Sam, >> >> Can you please suggest on where to start profiling osd? If the >> bottleneck has related to such non-complex things as directio speed, >> I`m sure that I was able to catch it long ago, even crossing around by >> results of other types of benchmarks at host system. I`ve just tried >> tmpfs under both journals, it has a small boost effect, as expected >> because of near-zero i/o delay. May be chunk distribution mechanism >> does not work well on such small amount of nodes but right now I don`t >> have necessary amount of hardware nodes to prove or disprove that. >> >> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov wrote: >>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >>> Starting 1 process >>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] >>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647 >>> write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec >>> clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 >>> bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 >>> cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 >>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> >=64=0.0% >>> issued r/w: total=0/40960, short=0/0 >>> lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% >>> lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% >>> >>> >>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: Our journal writes are actually sequential. Could you send FIO results for sequential 4k writes osd.0's journal and osd.1's journal? -Sam On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: > FIO output for journal partition, directio enabled, seems good(same > results for ext4 on other single sata disks). > > random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 > Starting 1 process > Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] > random-rw: (groupid=0, jobs=1): err= 0: pid=21926 > write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec > clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 > bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, > stdev=480.05 > cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w: total=0/40960, short=0/0 > lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% > lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% > lat (msec): 500=0.04% > > > > On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just > wrote: >> (CCing the list) >> >> So, the problem isn't the bandwidth. Before we respond to the client, >> we write the operation to the journal. In this case, that operation >> is taking >1s per operation on osd.1. Both rbd and rados bench will >> only allow a limited number of ops in flight at a time, so this >> latency is killing your throughput. For comparison, the latency for >> writing to the journal on osd.0 is < .3s. Can you measure direct io >> latency for writes to your osd.1 journal file? >> -Sam >> >> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: >>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >>> not Megabits. >>> >>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov >>> wrote: [global] log dir = /ceph/out log_file = "" logger dir = /ceph/log pid file = /ceph/out/$type$id.pid [mds] pid file = /ceph/out/$name.pid lockdep = 1 mds log max segments = 2 [osd] lockdep = 1 filestore_xattr_use_omap = 1 osd data = /ceph/dev/osd$id osd journal = /ceph/meta/journal osd journal size = 100 [mon] lockdep = 1 mon data = /ceph/dev/mon$id [mon.0] host = 172.20.1.32 mon addr = 172.20.1.32:6789 [mon.1]
Re: Mysteriously poor write performance
(CCing the list) Actually, can you could re-do the rados bench run with 'debug journal = 20' along with the other debugging? That should give us better information. -Sam On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov wrote: > Hi Sam, > > Can you please suggest on where to start profiling osd? If the > bottleneck has related to such non-complex things as directio speed, > I`m sure that I was able to catch it long ago, even crossing around by > results of other types of benchmarks at host system. I`ve just tried > tmpfs under both journals, it has a small boost effect, as expected > because of near-zero i/o delay. May be chunk distribution mechanism > does not work well on such small amount of nodes but right now I don`t > have necessary amount of hardware nodes to prove or disprove that. > > On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov wrote: >> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=9647 >> write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec >> clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 >> bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 >> cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% >> lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% >> >> >> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: >>> Our journal writes are actually sequential. Could you send FIO >>> results for sequential 4k writes osd.0's journal and osd.1's journal? >>> -Sam >>> >>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: FIO output for journal partition, directio enabled, seems good(same results for ext4 on other single sata disks). random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=21926 write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% lat (msec): 500=0.04% On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just wrote: > (CCing the list) > > So, the problem isn't the bandwidth. Before we respond to the client, > we write the operation to the journal. In this case, that operation > is taking >1s per operation on osd.1. Both rbd and rados bench will > only allow a limited number of ops in flight at a time, so this > latency is killing your throughput. For comparison, the latency for > writing to the journal on osd.0 is < .3s. Can you measure direct io > latency for writes to your osd.1 journal file? > -Sam > > On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: >> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >> not Megabits. >> >> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: >>> [global] >>> log dir = /ceph/out >>> log_file = "" >>> logger dir = /ceph/log >>> pid file = /ceph/out/$type$id.pid >>> [mds] >>> pid file = /ceph/out/$name.pid >>> lockdep = 1 >>> mds log max segments = 2 >>> [osd] >>> lockdep = 1 >>> filestore_xattr_use_omap = 1 >>> osd data = /ceph/dev/osd$id >>> osd journal = /ceph/meta/journal >>> osd journal size = 100 >>> [mon] >>> lockdep = 1 >>> mon data = /ceph/dev/mon$id >>> [mon.0] >>> host = 172.20.1.32 >>> mon addr = 172.20.1.32:6789 >>> [mon.1] >>> host = 172.20.1.33 >>> mon addr = 172.20.1.33:6789 >>> [mon.2] >>> host = 172.20.1.35 >>> mon addr = 172.20.1.35:6789 >>> [osd.0] >>> host = 172.20.1.32 >>> [osd.1] >>>
Re: Mysteriously poor write performance
random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=9647 write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: > Our journal writes are actually sequential. Could you send FIO > results for sequential 4k writes osd.0's journal and osd.1's journal? > -Sam > > On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: >> FIO output for journal partition, directio enabled, seems good(same >> results for ext4 on other single sata disks). >> >> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=21926 >> write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec >> clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 >> bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 >> cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% >> lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% >> lat (msec): 500=0.04% >> >> >> >> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just wrote: >>> (CCing the list) >>> >>> So, the problem isn't the bandwidth. Before we respond to the client, >>> we write the operation to the journal. In this case, that operation >>> is taking >1s per operation on osd.1. Both rbd and rados bench will >>> only allow a limited number of ops in flight at a time, so this >>> latency is killing your throughput. For comparison, the latency for >>> writing to the journal on osd.0 is < .3s. Can you measure direct io >>> latency for writes to your osd.1 journal file? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, not Megabits. On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: > [global] > log dir = /ceph/out > log_file = "" > logger dir = /ceph/log > pid file = /ceph/out/$type$id.pid > [mds] > pid file = /ceph/out/$name.pid > lockdep = 1 > mds log max segments = 2 > [osd] > lockdep = 1 > filestore_xattr_use_omap = 1 > osd data = /ceph/dev/osd$id > osd journal = /ceph/meta/journal > osd journal size = 100 > [mon] > lockdep = 1 > mon data = /ceph/dev/mon$id > [mon.0] > host = 172.20.1.32 > mon addr = 172.20.1.32:6789 > [mon.1] > host = 172.20.1.33 > mon addr = 172.20.1.33:6789 > [mon.2] > host = 172.20.1.35 > mon addr = 172.20.1.35:6789 > [osd.0] > host = 172.20.1.32 > [osd.1] > host = 172.20.1.33 > [mds.a] > host = 172.20.1.32 > > /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) > /dev/mapper/system-cephmeta on /ceph/meta type ext4 > (rw,barrier=0,user_xattr) > Simple performance tests on those fs shows ~133Mb/s for /ceph and > metadata/. Also both machines do not hold anything else which may > impact osd. > > Also please note of following: > > http://i.imgur.com/ZgFdO.png > > First two peaks are related to running rados bench, then goes cluster > recreation, automated debian install and final peaks are dd test. > Surely I can have more precise graphs, but current one probably enough > to state a situation - rbd utilizing about a quarter of possible > bandwidth(if we can count rados bench as 100%). > > On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just > wrote: >> Hmm, there seem to be writes taking as long as 1.5s to hit journal on >> osd.1... Could you post your ceph.conf? Might there be a problem >> with the osd.1 jo
Re: Mysteriously poor write performance
Our journal writes are actually sequential. Could you send FIO results for sequential 4k writes osd.0's journal and osd.1's journal? -Sam On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: > FIO output for journal partition, directio enabled, seems good(same > results for ext4 on other single sata disks). > > random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 > Starting 1 process > Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] > random-rw: (groupid=0, jobs=1): err= 0: pid=21926 > write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec > clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 > bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 > cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w: total=0/40960, short=0/0 > lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% > lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% > lat (msec): 500=0.04% > > > > On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just wrote: >> (CCing the list) >> >> So, the problem isn't the bandwidth. Before we respond to the client, >> we write the operation to the journal. In this case, that operation >> is taking >1s per operation on osd.1. Both rbd and rados bench will >> only allow a limited number of ops in flight at a time, so this >> latency is killing your throughput. For comparison, the latency for >> writing to the journal on osd.0 is < .3s. Can you measure direct io >> latency for writes to your osd.1 journal file? >> -Sam >> >> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: >>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >>> not Megabits. >>> >>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: [global] log dir = /ceph/out log_file = "" logger dir = /ceph/log pid file = /ceph/out/$type$id.pid [mds] pid file = /ceph/out/$name.pid lockdep = 1 mds log max segments = 2 [osd] lockdep = 1 filestore_xattr_use_omap = 1 osd data = /ceph/dev/osd$id osd journal = /ceph/meta/journal osd journal size = 100 [mon] lockdep = 1 mon data = /ceph/dev/mon$id [mon.0] host = 172.20.1.32 mon addr = 172.20.1.32:6789 [mon.1] host = 172.20.1.33 mon addr = 172.20.1.33:6789 [mon.2] host = 172.20.1.35 mon addr = 172.20.1.35:6789 [osd.0] host = 172.20.1.32 [osd.1] host = 172.20.1.33 [mds.a] host = 172.20.1.32 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr) Simple performance tests on those fs shows ~133Mb/s for /ceph and metadata/. Also both machines do not hold anything else which may impact osd. Also please note of following: http://i.imgur.com/ZgFdO.png First two peaks are related to running rados bench, then goes cluster recreation, automated debian install and final peaks are dd test. Surely I can have more precise graphs, but current one probably enough to state a situation - rbd utilizing about a quarter of possible bandwidth(if we can count rados bench as 100%). On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just wrote: > Hmm, there seem to be writes taking as long as 1.5s to hit journal on > osd.1... Could you post your ceph.conf? Might there be a problem > with the osd.1 journal disk? > -Sam > > On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov wrote: >> Oh, sorry - they probably inherited rights from log files, fixed. >> >> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just >> wrote: >>> I get 403 Forbidden when I try to download any of the files. >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov >>> wrote: http://xdel.ru/downloads/ceph-logs/ 1/ contains logs related to bench initiated at the osd0 machine and 2/ - at osd1. On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just wrote: > Hmm, I'm seeing some very high latency on ops sent to osd.1. Can you > post osd.1's logs? > -Sam > > On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov > wrote: >> Here, please: http://xdel.ru/downloads/ceph.log.gz >> >> Sometimes 'cur MB/s ' shows zero during rados bench, even if any >> debug >> output disabled and log_file set to
Re: Mysteriously poor write performance
(CCing the list) So, the problem isn't the bandwidth. Before we respond to the client, we write the operation to the journal. In this case, that operation is taking >1s per operation on osd.1. Both rbd and rados bench will only allow a limited number of ops in flight at a time, so this latency is killing your throughput. For comparison, the latency for writing to the journal on osd.0 is < .3s. Can you measure direct io latency for writes to your osd.1 journal file? -Sam On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: > Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, > not Megabits. > > On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: >> [global] >> log dir = /ceph/out >> log_file = "" >> logger dir = /ceph/log >> pid file = /ceph/out/$type$id.pid >> [mds] >> pid file = /ceph/out/$name.pid >> lockdep = 1 >> mds log max segments = 2 >> [osd] >> lockdep = 1 >> filestore_xattr_use_omap = 1 >> osd data = /ceph/dev/osd$id >> osd journal = /ceph/meta/journal >> osd journal size = 100 >> [mon] >> lockdep = 1 >> mon data = /ceph/dev/mon$id >> [mon.0] >> host = 172.20.1.32 >> mon addr = 172.20.1.32:6789 >> [mon.1] >> host = 172.20.1.33 >> mon addr = 172.20.1.33:6789 >> [mon.2] >> host = 172.20.1.35 >> mon addr = 172.20.1.35:6789 >> [osd.0] >> host = 172.20.1.32 >> [osd.1] >> host = 172.20.1.33 >> [mds.a] >> host = 172.20.1.32 >> >> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) >> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr) >> Simple performance tests on those fs shows ~133Mb/s for /ceph and >> metadata/. Also both machines do not hold anything else which may >> impact osd. >> >> Also please note of following: >> >> http://i.imgur.com/ZgFdO.png >> >> First two peaks are related to running rados bench, then goes cluster >> recreation, automated debian install and final peaks are dd test. >> Surely I can have more precise graphs, but current one probably enough >> to state a situation - rbd utilizing about a quarter of possible >> bandwidth(if we can count rados bench as 100%). >> >> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just wrote: >>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on >>> osd.1... Could you post your ceph.conf? Might there be a problem >>> with the osd.1 journal disk? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov wrote: Oh, sorry - they probably inherited rights from log files, fixed. On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just wrote: > I get 403 Forbidden when I try to download any of the files. > -Sam > > On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov wrote: >> http://xdel.ru/downloads/ceph-logs/ >> >> 1/ contains logs related to bench initiated at the osd0 machine and 2/ >> - at osd1. >> >> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just >> wrote: >>> Hmm, I'm seeing some very high latency on ops sent to osd.1. Can you >>> post osd.1's logs? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov wrote: Here, please: http://xdel.ru/downloads/ceph.log.gz Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug output disabled and log_file set to the empty value, hope it`s okay. On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just wrote: > Can you set osd and filestore debugging to 20, restart the osds, run > rados bench as before, and post the logs? > -Sam Just > > On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov > wrote: >> rados bench 60 write -p data >> >> Total time run: 61.217676 >> Total writes made: 989 >> Write size: 4194304 >> Bandwidth (MB/sec): 64.622 >> >> Average Latency: 0.989608 >> Max latency: 2.21701 >> Min latency: 0.255315 >> >> Here a snip from osd log, seems write size is okay. >> >> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >> active+clean] removing repgather(0x31b5360 applying 10'83 >> rep_tid=597 >> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 >> [write >> 1220608~4096] 0.17eb9fd8) v4) >> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >> active+clean] q front is repgather(0x31b5360 applying 10'83 >> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 >> rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) >> >> So
Re: Mysteriously poor write performance
Can you set osd and filestore debugging to 20, restart the osds, run rados bench as before, and post the logs? -Sam Just On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov wrote: > rados bench 60 write -p data > > Total time run: 61.217676 > Total writes made: 989 > Write size: 4194304 > Bandwidth (MB/sec): 64.622 > > Average Latency: 0.989608 > Max latency: 2.21701 > Min latency: 0.255315 > > Here a snip from osd log, seems write size is okay. > > 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 > (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 > active+clean] removing repgather(0x31b5360 applying 10'83 rep_tid=597 > wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write > 1220608~4096] 0.17eb9fd8) v4) > 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 > (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 > active+clean] q front is repgather(0x31b5360 applying 10'83 > rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 > rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) > > Sorry for my previous question about rbd chunks, it was really stupid :) > > On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin > wrote: >> On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >>> >>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >>> mentioned too small value and I`ve changed it to 64M before posting >>> previous message with no success - both 8M and this value cause a >>> performance drop. When I tried to wrote small amount of data that can >>> be compared to writeback cache size(both on raw device and ext3 with >>> sync option), following results were made: >> >> >> I just want to clarify that the writeback window isn't a full writeback >> cache - it doesn't affect reads, and does not help with request merging etc. >> It simply allows a bunch of writes to be in flight while acking the write to >> the guest immediately. We're working on a full-fledged writeback cache that >> to replace the writeback window. >> >> >>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >>> same without oflag there and in the following samples) >>> 10+0 records in >>> 10+0 records out >>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >>> 20+0 records in >>> 20+0 records out >>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >>> 30+0 records in >>> 30+0 records out >>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >>> >>> and so on. Reference test with bs=1M and count=2000 has slightly worse >>> results _with_ writeback cache than without, as I`ve mentioned before. >>> Here the bench results, they`re almost equal on both nodes: >>> >>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec >> >> >> One thing to check is the size of the writes that are actually being sent by >> rbd. The guest is probably splitting them into relatively small (128 or >> 256k) writes. Ideally it would be sending 4k writes, and this should be a >> lot faster. >> >> You can see the writes being sent by adding debug_ms=1 to the client or osd. >> The format is osd_op(.*[write OFFSET~LENGTH]). >> >> >>> Also, because I`ve not mentioned it before, network performance is >>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >>> is not interrupt problem or something like it - even if ceph-osd, >>> ethernet card queues and kvm instance pinned to different sets of >>> cores, nothing changes. >>> >>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >>> wrote: It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > More strangely, writing speed drops down by fifteen percent when this > option was set in vm` config(instead of result from > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes > under heavy load. > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil (mailto:s...@newdream.net)> wrote: >> >> On Sat, 17 Mar 2012, Andrey Korolyov wrote: >>> >>> Hi, >>> >>> I`ve did some performance test
Re: Mysteriously poor write performance
rados bench 60 write -p data Total time run:61.217676 Total writes made: 989 Write size:4194304 Bandwidth (MB/sec):64.622 Average Latency: 0.989608 Max latency: 2.21701 Min latency: 0.255315 Here a snip from osd log, seems write size is okay. 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 active+clean] removing repgather(0x31b5360 applying 10'83 rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 active+clean]q front is repgather(0x31b5360 applying 10'83 rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) Sorry for my previous question about rbd chunks, it was really stupid :) On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin wrote: > On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >> >> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >> mentioned too small value and I`ve changed it to 64M before posting >> previous message with no success - both 8M and this value cause a >> performance drop. When I tried to wrote small amount of data that can >> be compared to writeback cache size(both on raw device and ext3 with >> sync option), following results were made: > > > I just want to clarify that the writeback window isn't a full writeback > cache - it doesn't affect reads, and does not help with request merging etc. > It simply allows a bunch of writes to be in flight while acking the write to > the guest immediately. We're working on a full-fledged writeback cache that > to replace the writeback window. > > >> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >> same without oflag there and in the following samples) >> 10+0 records in >> 10+0 records out >> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >> 20+0 records in >> 20+0 records out >> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >> 30+0 records in >> 30+0 records out >> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >> >> and so on. Reference test with bs=1M and count=2000 has slightly worse >> results _with_ writeback cache than without, as I`ve mentioned before. >> Here the bench results, they`re almost equal on both nodes: >> >> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec > > > One thing to check is the size of the writes that are actually being sent by > rbd. The guest is probably splitting them into relatively small (128 or > 256k) writes. Ideally it would be sending 4k writes, and this should be a > lot faster. > > You can see the writes being sent by adding debug_ms=1 to the client or osd. > The format is osd_op(.*[write OFFSET~LENGTH]). > > >> Also, because I`ve not mentioned it before, network performance is >> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >> is not interrupt problem or something like it - even if ceph-osd, >> ethernet card queues and kvm instance pinned to different sets of >> cores, nothing changes. >> >> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >> wrote: >>> >>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>> only works for userspace rbd implementations (eg, KVM). >>> If you are using KVM, you probably want 8192 (~80MB) rather than >>> 8192000 (~8MB). >>> >>> What options are you running dd with? If you run a rados bench from both >>> machines, what do the results look like? >>> Also, can you do the ceph osd bench on each of your OSDs, please? >>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>> -Greg >>> >>> >>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>> More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)> wrote: > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> >> Hi, >> >> I`ve did some performance tests at the following configuration: >> >> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> dom0 with three dedicated cores and 1.5G, mostly idle. First three >> disks on each r410 arranged into raid0 and holds osd data when fourth >> holds os and osd` journal partition, all
Re: Mysteriously poor write performance
Thanks to Greg, I have noticed very strange thing - data pool filled with a bunch of objects like rb.0.0.04db with typical size 4194304 when original pool for guest os has size only 112(created as 40g). Seems that something went wrong, because on 0.42 I had more impressive performance on cheaper hardware. For first time, I blamed recent crash and recreated cluster from scratch about a hour ago, but those objects created in a bare data/ pool with only one vm. On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin wrote: > On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >> >> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >> mentioned too small value and I`ve changed it to 64M before posting >> previous message with no success - both 8M and this value cause a >> performance drop. When I tried to wrote small amount of data that can >> be compared to writeback cache size(both on raw device and ext3 with >> sync option), following results were made: > > > I just want to clarify that the writeback window isn't a full writeback > cache - it doesn't affect reads, and does not help with request merging etc. > It simply allows a bunch of writes to be in flight while acking the write to > the guest immediately. We're working on a full-fledged writeback cache that > to replace the writeback window. > > >> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >> same without oflag there and in the following samples) >> 10+0 records in >> 10+0 records out >> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >> 20+0 records in >> 20+0 records out >> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >> 30+0 records in >> 30+0 records out >> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >> >> and so on. Reference test with bs=1M and count=2000 has slightly worse >> results _with_ writeback cache than without, as I`ve mentioned before. >> Here the bench results, they`re almost equal on both nodes: >> >> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec > > > One thing to check is the size of the writes that are actually being sent by > rbd. The guest is probably splitting them into relatively small (128 or > 256k) writes. Ideally it would be sending 4k writes, and this should be a > lot faster. > > You can see the writes being sent by adding debug_ms=1 to the client or osd. > The format is osd_op(.*[write OFFSET~LENGTH]). > > >> Also, because I`ve not mentioned it before, network performance is >> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >> is not interrupt problem or something like it - even if ceph-osd, >> ethernet card queues and kvm instance pinned to different sets of >> cores, nothing changes. >> >> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >> wrote: >>> >>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>> only works for userspace rbd implementations (eg, KVM). >>> If you are using KVM, you probably want 8192 (~80MB) rather than >>> 8192000 (~8MB). >>> >>> What options are you running dd with? If you run a rados bench from both >>> machines, what do the results look like? >>> Also, can you do the ceph osd bench on each of your OSDs, please? >>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>> -Greg >>> >>> >>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>> More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)> wrote: > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> >> Hi, >> >> I`ve did some performance tests at the following configuration: >> >> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> dom0 with three dedicated cores and 1.5G, mostly idle. First three >> disks on each r410 arranged into raid0 and holds osd data when fourth >> holds os and osd` journal partition, all ceph-related stuff mounted on >> the ext4 without barriers. >> >> Firstly, I`ve noticed about a difference of benchmark performance and >> write speed through rbd from small kvm instance running on one of >> first two machines - when bench gave me about 110Mb/s, writing zeros >> to raw block device inside vm with dd was at top speed about 45 mb/s, >> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >> Things get worse, when I`ve started second vm at second hos
Re: Mysteriously poor write performance
On 03/19/2012 11:13 AM, Andrey Korolyov wrote: Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage mentioned too small value and I`ve changed it to 64M before posting previous message with no success - both 8M and this value cause a performance drop. When I tried to wrote small amount of data that can be compared to writeback cache size(both on raw device and ext3 with sync option), following results were made: I just want to clarify that the writeback window isn't a full writeback cache - it doesn't affect reads, and does not help with request merging etc. It simply allows a bunch of writes to be in flight while acking the write to the guest immediately. We're working on a full-fledged writeback cache that to replace the writeback window. dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost same without oflag there and in the following samples) 10+0 records in 10+0 records out 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct 20+0 records in 20+0 records out 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct 30+0 records in 30+0 records out 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s and so on. Reference test with bs=1M and count=2000 has slightly worse results _with_ writeback cache than without, as I`ve mentioned before. Here the bench results, they`re almost equal on both nodes: bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec One thing to check is the size of the writes that are actually being sent by rbd. The guest is probably splitting them into relatively small (128 or 256k) writes. Ideally it would be sending 4k writes, and this should be a lot faster. You can see the writes being sent by adding debug_ms=1 to the client or osd. The format is osd_op(.*[write OFFSET~LENGTH]). Also, because I`ve not mentioned it before, network performance is enough to hold fair gigabit connectivity with MTU 1500. Seems that it is not interrupt problem or something like it - even if ceph-osd, ethernet card queues and kvm instance pinned to different sets of cores, nothing changes. On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum wrote: It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weilmailto:s...@newdream.net)> wrote: On Sat, 17 Mar 2012, Andrey Korolyov wrote: Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe fr
Re: Mysteriously poor write performance
On Monday, March 19, 2012 at 11:13 AM, Andrey Korolyov wrote: > Nope, I`m using KVM for rbd guests. Ah, okay — I'm not sure what your reference to dom0 and mon2 meant, then? > Surely I`ve been noticed that Sage > mentioned too small value and I`ve changed it to 64M before posting > previous message with no success - both 8M and this value cause a > performance drop. When I tried to wrote small amount of data that can > be compared to writeback cache size(both on raw device and ext3 with > sync option), following results were made: > dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost > same without oflag there and in the following samples) > 10+0 records in > 10+0 records out > 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s > dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct > 20+0 records in > 20+0 records out > 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s > dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct > 30+0 records in > 30+0 records out > 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s > > and so on. Reference test with bs=1M and count=2000 has slightly worse > results _with_ writeback cache than without, as I`ve mentioned before. > Here the bench results, they`re almost equal on both nodes: > > bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec Okay, this is all a little odd to me. Can you send along your ceph.conf (along with any other pool config changes you've made) and the output from a rados bench (60 seconds or so)? -Greg > > Also, because I`ve not mentioned it before, network performance is > enough to hold fair gigabit connectivity with MTU 1500. Seems that it > is not interrupt problem or something like it - even if ceph-osd, > ethernet card queues and kvm instance pinned to different sets of > cores, nothing changes. > > On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum > mailto:gregory.far...@dreamhost.com)> wrote: > > It sounds like maybe you're using Xen? The "rbd writeback window" option > > only works for userspace rbd implementations (eg, KVM). > > If you are using KVM, you probably want 8192 (~80MB) rather than > > 8192000 (~8MB). > > > > What options are you running dd with? If you run a rados bench from both > > machines, what do the results look like? > > Also, can you do the ceph osd bench on each of your OSDs, please? > > (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) > > -Greg > > > > > > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > > > > > More strangely, writing speed drops down by fifteen percent when this > > > option was set in vm` config(instead of result from > > > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). > > > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been > > > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and > > > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes > > > under heavy load. > > > > > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil > > (mailto:s...@newdream.net)> wrote: > > > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: > > > > > Hi, > > > > > > > > > > I`ve did some performance tests at the following configuration: > > > > > > > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - > > > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three > > > > > disks on each r410 arranged into raid0 and holds osd data when fourth > > > > > holds os and osd` journal partition, all ceph-related stuff mounted on > > > > > the ext4 without barriers. > > > > > > > > > > Firstly, I`ve noticed about a difference of benchmark performance and > > > > > write speed through rbd from small kvm instance running on one of > > > > > first two machines - when bench gave me about 110Mb/s, writing zeros > > > > > to raw block device inside vm with dd was at top speed about 45 mb/s, > > > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. > > > > > Things get worse, when I`ve started second vm at second host and tried > > > > > to continue same dd tests simultaneously - performance fairly divided > > > > > by half for each instance :). Enabling jumbo frames, playing with cpu > > > > > affinity for ceph and vm instances and trying different TCP congestion > > > > > protocols gave no effect at all - with DCTCP I have slightly smoother > > > > > network load graph and that`s all. > > > > > > > > > > Can ml please suggest anything to try to improve performance? > > > > > > > > Can you try setting > > > > > > > > rbd writeback window = 8192000 > > > > > > > > or similar, and see what kind of effect that has? I suspect it'll speed > > > > up dd; I'm less sure about ext3. > > > > > > > > Thanks! > > > > sage > > > > > > > > > > > > > > > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" >
Re: Mysteriously poor write performance
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage mentioned too small value and I`ve changed it to 64M before posting previous message with no success - both 8M and this value cause a performance drop. When I tried to wrote small amount of data that can be compared to writeback cache size(both on raw device and ext3 with sync option), following results were made: dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost same without oflag there and in the following samples) 10+0 records in 10+0 records out 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct 20+0 records in 20+0 records out 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct 30+0 records in 30+0 records out 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s and so on. Reference test with bs=1M and count=2000 has slightly worse results _with_ writeback cache than without, as I`ve mentioned before. Here the bench results, they`re almost equal on both nodes: bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec Also, because I`ve not mentioned it before, network performance is enough to hold fair gigabit connectivity with MTU 1500. Seems that it is not interrupt problem or something like it - even if ceph-osd, ethernet card queues and kvm instance pinned to different sets of cores, nothing changes. On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum wrote: > It sounds like maybe you're using Xen? The "rbd writeback window" option only > works for userspace rbd implementations (eg, KVM). > If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 > (~8MB). > > What options are you running dd with? If you run a rados bench from both > machines, what do the results look like? > Also, can you do the ceph osd bench on each of your OSDs, please? > (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) > -Greg > > > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > >> More strangely, writing speed drops down by fifteen percent when this >> option was set in vm` config(instead of result from >> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). >> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been >> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and >> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes >> under heavy load. >> >> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil > (mailto:s...@newdream.net)> wrote: >> > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> > > Hi, >> > > >> > > I`ve did some performance tests at the following configuration: >> > > >> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three >> > > disks on each r410 arranged into raid0 and holds osd data when fourth >> > > holds os and osd` journal partition, all ceph-related stuff mounted on >> > > the ext4 without barriers. >> > > >> > > Firstly, I`ve noticed about a difference of benchmark performance and >> > > write speed through rbd from small kvm instance running on one of >> > > first two machines - when bench gave me about 110Mb/s, writing zeros >> > > to raw block device inside vm with dd was at top speed about 45 mb/s, >> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >> > > Things get worse, when I`ve started second vm at second host and tried >> > > to continue same dd tests simultaneously - performance fairly divided >> > > by half for each instance :). Enabling jumbo frames, playing with cpu >> > > affinity for ceph and vm instances and trying different TCP congestion >> > > protocols gave no effect at all - with DCTCP I have slightly smoother >> > > network load graph and that`s all. >> > > >> > > Can ml please suggest anything to try to improve performance? >> > >> > Can you try setting >> > >> > rbd writeback window = 8192000 >> > >> > or similar, and see what kind of effect that has? I suspect it'll speed >> > up dd; I'm less sure about ext3. >> > >> > Thanks! >> > sage >> > >> > >> > > >> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >> > > -- >> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > > the body of a message to majord...@vger.kernel.org >> > > (mailto:majord...@vger.kernel.org) >> > > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> (mailto:majord...@vger.kernel.org) >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > More strangely, writing speed drops down by fifteen percent when this > option was set in vm` config(instead of result from > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes > under heavy load. > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil (mailto:s...@newdream.net)> wrote: > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: > > > Hi, > > > > > > I`ve did some performance tests at the following configuration: > > > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three > > > disks on each r410 arranged into raid0 and holds osd data when fourth > > > holds os and osd` journal partition, all ceph-related stuff mounted on > > > the ext4 without barriers. > > > > > > Firstly, I`ve noticed about a difference of benchmark performance and > > > write speed through rbd from small kvm instance running on one of > > > first two machines - when bench gave me about 110Mb/s, writing zeros > > > to raw block device inside vm with dd was at top speed about 45 mb/s, > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. > > > Things get worse, when I`ve started second vm at second host and tried > > > to continue same dd tests simultaneously - performance fairly divided > > > by half for each instance :). Enabling jumbo frames, playing with cpu > > > affinity for ceph and vm instances and trying different TCP congestion > > > protocols gave no effect at all - with DCTCP I have slightly smoother > > > network load graph and that`s all. > > > > > > Can ml please suggest anything to try to improve performance? > > > > Can you try setting > > > > rbd writeback window = 8192000 > > > > or similar, and see what kind of effect that has? I suspect it'll speed > > up dd; I'm less sure about ext3. > > > > Thanks! > > sage > > > > > > > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majord...@vger.kernel.org > > > (mailto:majord...@vger.kernel.org) > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > (mailto:majord...@vger.kernel.org) > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil wrote: > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> Hi, >> >> I`ve did some performance tests at the following configuration: >> >> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> dom0 with three dedicated cores and 1.5G, mostly idle. First three >> disks on each r410 arranged into raid0 and holds osd data when fourth >> holds os and osd` journal partition, all ceph-related stuff mounted on >> the ext4 without barriers. >> >> Firstly, I`ve noticed about a difference of benchmark performance and >> write speed through rbd from small kvm instance running on one of >> first two machines - when bench gave me about 110Mb/s, writing zeros >> to raw block device inside vm with dd was at top speed about 45 mb/s, >> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >> Things get worse, when I`ve started second vm at second host and tried >> to continue same dd tests simultaneously - performance fairly divided >> by half for each instance :). Enabling jumbo frames, playing with cpu >> affinity for ceph and vm instances and trying different TCP congestion >> protocols gave no effect at all - with DCTCP I have slightly smoother >> network load graph and that`s all. >> >> Can ml please suggest anything to try to improve performance? > > Can you try setting > > rbd writeback window = 8192000 > > or similar, and see what kind of effect that has? I suspect it'll speed > up dd; I'm less sure about ext3. > > Thanks! > sage > > >> >> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
On Sat, 17 Mar 2012, Andrey Korolyov wrote: > Hi, > > I`ve did some performance tests at the following configuration: > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - > dom0 with three dedicated cores and 1.5G, mostly idle. First three > disks on each r410 arranged into raid0 and holds osd data when fourth > holds os and osd` journal partition, all ceph-related stuff mounted on > the ext4 without barriers. > > Firstly, I`ve noticed about a difference of benchmark performance and > write speed through rbd from small kvm instance running on one of > first two machines - when bench gave me about 110Mb/s, writing zeros > to raw block device inside vm with dd was at top speed about 45 mb/s, > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. > Things get worse, when I`ve started second vm at second host and tried > to continue same dd tests simultaneously - performance fairly divided > by half for each instance :). Enabling jumbo frames, playing with cpu > affinity for ceph and vm instances and trying different TCP congestion > protocols gave no effect at all - with DCTCP I have slightly smoother > network load graph and that`s all. > > Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mysteriously poor write performance
Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html