Re: Mysteriously poor write performance

2012-03-24 Thread Andrey Korolyov
http://xdel.ru/downloads/ceph-logs-dbg/

On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just sam.j...@dreamhost.com wrote:
 (CCing the list)

 Actually, can you could re-do the rados bench run with 'debug journal
 = 20' along with the other debugging?  That should give us better
 information.

 -Sam

 On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi Sam,

 Can you please suggest on where to start profiling osd? If the
 bottleneck has related to such non-complex things as directio speed,
 I`m sure that I was able to catch it long ago, even crossing around by
 results of other types of benchmarks at host system. I`ve just tried
 tmpfs under both journals, it has a small boost effect, as expected
 because of near-zero i/o delay. May be chunk distribution mechanism
 does not work well on such small amount of nodes but right now I don`t
 have necessary amount of hardware nodes to prove or disprove that.

 On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov and...@xdel.ru wrote:
 random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 =64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


 On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just sam.j...@dreamhost.com wrote:
 Our journal writes are actually sequential.  Could you send FIO
 results for sequential 4k writes osd.0's journal and osd.1's journal?
 -Sam

 On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov and...@xdel.ru wrote:
 FIO output for journal partition, directio enabled, seems good(same
 results for ext4 on other single sata disks).

 random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=21926
  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, 
 stdev=480.05
  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 =64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
     lat (msec): 500=0.04%



 On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 (CCing the list)

 So, the problem isn't the bandwidth.  Before we respond to the client,
 we write the operation to the journal.  In this case, that operation
 is taking 1s per operation on osd.1.  Both rbd and rados bench will
 only allow a limited number of ops in flight at a time, so this
 latency is killing your throughput.  For comparison, the latency for
 writing to the journal on osd.0 is  .3s.  Can you measure direct io
 latency for writes to your osd.1 journal file?
 -Sam

 On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov and...@xdel.ru wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov and...@xdel.ru 
 wrote:
 [global]
       log dir = /ceph/out
       log_file = 
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
       host = 172.20.1.33
       mon addr = 172.20.1.33:6789
 [mon.2]
       host = 172.20.1.35
       mon addr = 172.20.1.35:6789
 [osd.0]
       host = 172.20.1.32
 [osd.1]
       host = 172.20.1.33
 [mds.a]
       host = 172.20.1.32

 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
 /dev/mapper/system-cephmeta on /ceph/meta type ext4 
 (rw,barrier=0,user_xattr)
 Simple performance tests on those fs 

Re: Mysteriously poor write performance

2012-03-23 Thread Samuel Just
(CCing the list)

Actually, can you could re-do the rados bench run with 'debug journal
= 20' along with the other debugging?  That should give us better
information.

-Sam

On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi Sam,

 Can you please suggest on where to start profiling osd? If the
 bottleneck has related to such non-complex things as directio speed,
 I`m sure that I was able to catch it long ago, even crossing around by
 results of other types of benchmarks at host system. I`ve just tried
 tmpfs under both journals, it has a small boost effect, as expected
 because of near-zero i/o delay. May be chunk distribution mechanism
 does not work well on such small amount of nodes but right now I don`t
 have necessary amount of hardware nodes to prove or disprove that.

 On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov and...@xdel.ru wrote:
 random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


 On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just sam.j...@dreamhost.com wrote:
 Our journal writes are actually sequential.  Could you send FIO
 results for sequential 4k writes osd.0's journal and osd.1's journal?
 -Sam

 On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov and...@xdel.ru wrote:
 FIO output for journal partition, directio enabled, seems good(same
 results for ext4 on other single sata disks).

 random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=21926
  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 =64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
     lat (msec): 500=0.04%



 On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 (CCing the list)

 So, the problem isn't the bandwidth.  Before we respond to the client,
 we write the operation to the journal.  In this case, that operation
 is taking 1s per operation on osd.1.  Both rbd and rados bench will
 only allow a limited number of ops in flight at a time, so this
 latency is killing your throughput.  For comparison, the latency for
 writing to the journal on osd.0 is  .3s.  Can you measure direct io
 latency for writes to your osd.1 journal file?
 -Sam

 On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov and...@xdel.ru wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 [global]
       log dir = /ceph/out
       log_file = 
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
       host = 172.20.1.33
       mon addr = 172.20.1.33:6789
 [mon.2]
       host = 172.20.1.35
       mon addr = 172.20.1.35:6789
 [osd.0]
       host = 172.20.1.32
 [osd.1]
       host = 172.20.1.33
 [mds.a]
       host = 172.20.1.32

 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
 /dev/mapper/system-cephmeta on /ceph/meta type ext4 
 (rw,barrier=0,user_xattr)
 Simple performance tests on those fs shows ~133Mb/s for /ceph and
 metadata/. Also both machines do not hold anything else which may
 impact osd.

 Also please note 

Re: Mysteriously poor write performance

2012-03-22 Thread Samuel Just
Our journal writes are actually sequential.  Could you send FIO
results for sequential 4k writes osd.0's journal and osd.1's journal?
-Sam

On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov and...@xdel.ru wrote:
 FIO output for journal partition, directio enabled, seems good(same
 results for ext4 on other single sata disks).

 random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=21926
  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
     lat (msec): 500=0.04%



 On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just sam.j...@dreamhost.com wrote:
 (CCing the list)

 So, the problem isn't the bandwidth.  Before we respond to the client,
 we write the operation to the journal.  In this case, that operation
 is taking 1s per operation on osd.1.  Both rbd and rados bench will
 only allow a limited number of ops in flight at a time, so this
 latency is killing your throughput.  For comparison, the latency for
 writing to the journal on osd.0 is  .3s.  Can you measure direct io
 latency for writes to your osd.1 journal file?
 -Sam

 On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov and...@xdel.ru wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 [global]
       log dir = /ceph/out
       log_file = 
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
       host = 172.20.1.33
       mon addr = 172.20.1.33:6789
 [mon.2]
       host = 172.20.1.35
       mon addr = 172.20.1.35:6789
 [osd.0]
       host = 172.20.1.32
 [osd.1]
       host = 172.20.1.33
 [mds.a]
       host = 172.20.1.32

 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
 /dev/mapper/system-cephmeta on /ceph/meta type ext4 
 (rw,barrier=0,user_xattr)
 Simple performance tests on those fs shows ~133Mb/s for /ceph and
 metadata/. Also both machines do not hold anything else which may
 impact osd.

 Also please note of following:

 http://i.imgur.com/ZgFdO.png

 First two peaks are related to running rados bench, then goes cluster
 recreation, automated debian install and final peaks are dd test.
 Surely I can have more precise graphs, but current one probably enough
 to state a situation - rbd utilizing about a quarter of possible
 bandwidth(if we can count rados bench as 100%).

 On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 Hmm, there seem to be writes taking as long as 1.5s to hit journal on
 osd.1...  Could you post your ceph.conf?  Might there be a problem
 with the osd.1 journal disk?
 -Sam

 On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov and...@xdel.ru wrote:
 Oh, sorry - they probably inherited rights from log files, fixed.

 On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 I get 403 Forbidden when I try to download any of the files.
 -Sam

 On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov and...@xdel.ru 
 wrote:
 http://xdel.ru/downloads/ceph-logs/

 1/ contains logs related to bench initiated at the osd0 machine and 2/
 - at osd1.

 On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
 post osd.1's logs?
 -Sam

 On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov and...@xdel.ru 
 wrote:
 Here, please: http://xdel.ru/downloads/ceph.log.gz

 Sometimes 'cur MB/s ' shows zero during rados bench, even if any 
 debug
 output disabled and log_file set to the empty value, hope it`s okay.

 On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just 
 sam.j...@dreamhost.com wrote:
 Can you set osd and filestore debugging to 20, restart the osds, run
 rados bench as before, and post the logs?
 -Sam Just

 On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov and...@xdel.ru 
 wrote:
 rados bench 60 write -p data
 skip
 Total 

Re: Mysteriously poor write performance

2012-03-21 Thread Samuel Just
(CCing the list)

So, the problem isn't the bandwidth.  Before we respond to the client,
we write the operation to the journal.  In this case, that operation
is taking 1s per operation on osd.1.  Both rbd and rados bench will
only allow a limited number of ops in flight at a time, so this
latency is killing your throughput.  For comparison, the latency for
writing to the journal on osd.0 is  .3s.  Can you measure direct io
latency for writes to your osd.1 journal file?
-Sam

On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov and...@xdel.ru wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 [global]
       log dir = /ceph/out
       log_file = 
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
       host = 172.20.1.33
       mon addr = 172.20.1.33:6789
 [mon.2]
       host = 172.20.1.35
       mon addr = 172.20.1.35:6789
 [osd.0]
       host = 172.20.1.32
 [osd.1]
       host = 172.20.1.33
 [mds.a]
       host = 172.20.1.32

 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
 /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
 Simple performance tests on those fs shows ~133Mb/s for /ceph and
 metadata/. Also both machines do not hold anything else which may
 impact osd.

 Also please note of following:

 http://i.imgur.com/ZgFdO.png

 First two peaks are related to running rados bench, then goes cluster
 recreation, automated debian install and final peaks are dd test.
 Surely I can have more precise graphs, but current one probably enough
 to state a situation - rbd utilizing about a quarter of possible
 bandwidth(if we can count rados bench as 100%).

 On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just sam.j...@dreamhost.com wrote:
 Hmm, there seem to be writes taking as long as 1.5s to hit journal on
 osd.1...  Could you post your ceph.conf?  Might there be a problem
 with the osd.1 journal disk?
 -Sam

 On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov and...@xdel.ru wrote:
 Oh, sorry - they probably inherited rights from log files, fixed.

 On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 I get 403 Forbidden when I try to download any of the files.
 -Sam

 On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov and...@xdel.ru wrote:
 http://xdel.ru/downloads/ceph-logs/

 1/ contains logs related to bench initiated at the osd0 machine and 2/
 - at osd1.

 On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
 post osd.1's logs?
 -Sam

 On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov and...@xdel.ru wrote:
 Here, please: http://xdel.ru/downloads/ceph.log.gz

 Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
 output disabled and log_file set to the empty value, hope it`s okay.

 On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just sam.j...@dreamhost.com 
 wrote:
 Can you set osd and filestore debugging to 20, restart the osds, run
 rados bench as before, and post the logs?
 -Sam Just

 On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov and...@xdel.ru 
 wrote:
 rados bench 60 write -p data
 skip
 Total time run:        61.217676
 Total writes made:     989
 Write size:            4194304
 Bandwidth (MB/sec):    64.622

 Average Latency:       0.989608
 Max latency:           2.21701
 Min latency:           0.255315

 Here a snip from osd log, seems write size is okay.

 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
 active+clean]  removing repgather(0x31b5360 applying 10'83 
 rep_tid=597
 wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 
 [write
 1220608~4096] 0.17eb9fd8) v4)
 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
 active+clean]    q front is repgather(0x31b5360 applying 10'83
 rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
 rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)

 Sorry for my previous question about rbd chunks, it was really 
 stupid :)

 On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin 
 josh.dur...@dreamhost.com wrote:
 On 03/19/2012 11:13 AM, Andrey Korolyov wrote:

 Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that 
 Sage
 mentioned too small value and I`ve changed it to 64M before posting
 previous message with no success - both 8M and this value cause a
 performance drop. When I 

Re: Mysteriously poor write performance

2012-03-20 Thread Andrey Korolyov
rados bench 60 write -p data
skip
Total time run:61.217676
Total writes made: 989
Write size:4194304
Bandwidth (MB/sec):64.622

Average Latency:   0.989608
Max latency:   2.21701
Min latency:   0.255315

Here a snip from osd log, seems write size is okay.

2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write
1220608~4096] 0.17eb9fd8) v4)
2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]q front is repgather(0x31b5360 applying 10'83
rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)

Sorry for my previous question about rbd chunks, it was really stupid :)

On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin josh.dur...@dreamhost.com wrote:
 On 03/19/2012 11:13 AM, Andrey Korolyov wrote:

 Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
 mentioned too small value and I`ve changed it to 64M before posting
 previous message with no success - both 8M and this value cause a
 performance drop. When I tried to wrote small amount of data that can
 be compared to writeback cache size(both on raw device and ext3 with
 sync option), following results were made:


 I just want to clarify that the writeback window isn't a full writeback
 cache - it doesn't affect reads, and does not help with request merging etc.
 It simply allows a bunch of writes to be in flight while acking the write to
 the guest immediately. We're working on a full-fledged writeback cache that
 to replace the writeback window.


 dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
 same without oflag there and in the following samples)
 10+0 records in
 10+0 records out
 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
 dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
 20+0 records in
 20+0 records out
 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
 dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
 30+0 records in
 30+0 records out
 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

 and so on. Reference test with bs=1M and count=2000 has slightly worse
 results _with_ writeback cache than without, as I`ve mentioned before.
  Here the bench results, they`re almost equal on both nodes:

 bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec


 One thing to check is the size of the writes that are actually being sent by
 rbd. The guest is probably splitting them into relatively small (128 or
 256k) writes. Ideally it would be sending 4k writes, and this should be a
 lot faster.

 You can see the writes being sent by adding debug_ms=1 to the client or osd.
 The format is osd_op(.*[write OFFSET~LENGTH]).


 Also, because I`ve not mentioned it before, network performance is
 enough to hold fair gigabit connectivity with MTU 1500. Seems that it
 is not interrupt problem or something like it - even if ceph-osd,
 ethernet card queues and kvm instance pinned to different sets of
 cores, nothing changes.

 On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
 gregory.far...@dreamhost.com  wrote:

 It sounds like maybe you're using Xen? The rbd writeback window option
 only works for userspace rbd implementations (eg, KVM).
 If you are using KVM, you probably want 8192 (~80MB) rather than
 8192000 (~8MB).

 What options are you running dd with? If you run a rados bench from both
 machines, what do the results look like?
 Also, can you do the ceph osd bench on each of your OSDs, please?
 (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
 -Greg


 On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weils...@newdream.net
 (mailto:s...@newdream.net)  wrote:

 On Sat, 17 Mar 2012, Andrey Korolyov wrote:

 Hi,

 I`ve did some performance tests at the following configuration:

 mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
 dom0 with three dedicated cores and 1.5G, mostly idle. First three
 disks on each r410 arranged into raid0 and holds osd data when fourth
 holds os and osd` journal partition, all ceph-related stuff mounted on
 the ext4 without barriers.

 Firstly, I`ve noticed about a difference of benchmark performance and
 write speed through rbd from 

Re: Mysteriously poor write performance

2012-03-20 Thread Samuel Just
Can you set osd and filestore debugging to 20, restart the osds, run
rados bench as before, and post the logs?
-Sam Just

On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov and...@xdel.ru wrote:
 rados bench 60 write -p data
 skip
 Total time run:        61.217676
 Total writes made:     989
 Write size:            4194304
 Bandwidth (MB/sec):    64.622

 Average Latency:       0.989608
 Max latency:           2.21701
 Min latency:           0.255315

 Here a snip from osd log, seems write size is okay.

 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
 active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
 wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write
 1220608~4096] 0.17eb9fd8) v4)
 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
 active+clean]    q front is repgather(0x31b5360 applying 10'83
 rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
 rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)

 Sorry for my previous question about rbd chunks, it was really stupid :)

 On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin josh.dur...@dreamhost.com 
 wrote:
 On 03/19/2012 11:13 AM, Andrey Korolyov wrote:

 Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
 mentioned too small value and I`ve changed it to 64M before posting
 previous message with no success - both 8M and this value cause a
 performance drop. When I tried to wrote small amount of data that can
 be compared to writeback cache size(both on raw device and ext3 with
 sync option), following results were made:


 I just want to clarify that the writeback window isn't a full writeback
 cache - it doesn't affect reads, and does not help with request merging etc.
 It simply allows a bunch of writes to be in flight while acking the write to
 the guest immediately. We're working on a full-fledged writeback cache that
 to replace the writeback window.


 dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
 same without oflag there and in the following samples)
 10+0 records in
 10+0 records out
 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
 dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
 20+0 records in
 20+0 records out
 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
 dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
 30+0 records in
 30+0 records out
 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

 and so on. Reference test with bs=1M and count=2000 has slightly worse
 results _with_ writeback cache than without, as I`ve mentioned before.
  Here the bench results, they`re almost equal on both nodes:

 bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec


 One thing to check is the size of the writes that are actually being sent by
 rbd. The guest is probably splitting them into relatively small (128 or
 256k) writes. Ideally it would be sending 4k writes, and this should be a
 lot faster.

 You can see the writes being sent by adding debug_ms=1 to the client or osd.
 The format is osd_op(.*[write OFFSET~LENGTH]).


 Also, because I`ve not mentioned it before, network performance is
 enough to hold fair gigabit connectivity with MTU 1500. Seems that it
 is not interrupt problem or something like it - even if ceph-osd,
 ethernet card queues and kvm instance pinned to different sets of
 cores, nothing changes.

 On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
 gregory.far...@dreamhost.com  wrote:

 It sounds like maybe you're using Xen? The rbd writeback window option
 only works for userspace rbd implementations (eg, KVM).
 If you are using KVM, you probably want 8192 (~80MB) rather than
 8192000 (~8MB).

 What options are you running dd with? If you run a rados bench from both
 machines, what do the results look like?
 Also, can you do the ceph osd bench on each of your OSDs, please?
 (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
 -Greg


 On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weils...@newdream.net
 (mailto:s...@newdream.net)  wrote:

 On Sat, 17 Mar 2012, Andrey Korolyov wrote:

 Hi,

 I`ve did some performance tests at the following configuration:

 mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
 dom0 with three dedicated cores and 1.5G, mostly idle. First three
 disks on each r410 arranged into raid0 and holds osd data 

Re: Mysteriously poor write performance

2012-03-19 Thread Greg Farnum
It sounds like maybe you're using Xen? The rbd writeback window option only 
works for userspace rbd implementations (eg, KVM). 
If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
(~8MB).

What options are you running dd with? If you run a rados bench from both 
machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? 
(http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.
 
 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil s...@newdream.net 
 (mailto:s...@newdream.net) wrote:
  On Sat, 17 Mar 2012, Andrey Korolyov wrote:
   Hi,
   
   I`ve did some performance tests at the following configuration:
   
   mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
   dom0 with three dedicated cores and 1.5G, mostly idle. First three
   disks on each r410 arranged into raid0 and holds osd data when fourth
   holds os and osd` journal partition, all ceph-related stuff mounted on
   the ext4 without barriers.
   
   Firstly, I`ve noticed about a difference of benchmark performance and
   write speed through rbd from small kvm instance running on one of
   first two machines - when bench gave me about 110Mb/s, writing zeros
   to raw block device inside vm with dd was at top speed about 45 mb/s,
   for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
   Things get worse, when I`ve started second vm at second host and tried
   to continue same dd tests simultaneously - performance fairly divided
   by half for each instance :). Enabling jumbo frames, playing with cpu
   affinity for ceph and vm instances and trying different TCP congestion
   protocols gave no effect at all - with DCTCP I have slightly smoother
   network load graph and that`s all.
   
   Can ml please suggest anything to try to improve performance?
  
  Can you try setting
  
  rbd writeback window = 8192000
  
  or similar, and see what kind of effect that has? I suspect it'll speed
  up dd; I'm less sure about ext3.
  
  Thanks!
  sage
  
  
   
   ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org 
   (mailto:majord...@vger.kernel.org)
   More majordomo info at http://vger.kernel.org/majordomo-info.html
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:
dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
 Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec

Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
gregory.far...@dreamhost.com wrote:
 It sounds like maybe you're using Xen? The rbd writeback window option only 
 works for userspace rbd implementations (eg, KVM).
 If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
 (~8MB).

 What options are you running dd with? If you run a rados bench from both 
 machines, what do the results look like?
 Also, can you do the ceph osd bench on each of your OSDs, please? 
 (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
 -Greg


 On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil s...@newdream.net 
 (mailto:s...@newdream.net) wrote:
  On Sat, 17 Mar 2012, Andrey Korolyov wrote:
   Hi,
  
   I`ve did some performance tests at the following configuration:
  
   mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
   dom0 with three dedicated cores and 1.5G, mostly idle. First three
   disks on each r410 arranged into raid0 and holds osd data when fourth
   holds os and osd` journal partition, all ceph-related stuff mounted on
   the ext4 without barriers.
  
   Firstly, I`ve noticed about a difference of benchmark performance and
   write speed through rbd from small kvm instance running on one of
   first two machines - when bench gave me about 110Mb/s, writing zeros
   to raw block device inside vm with dd was at top speed about 45 mb/s,
   for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
   Things get worse, when I`ve started second vm at second host and tried
   to continue same dd tests simultaneously - performance fairly divided
   by half for each instance :). Enabling jumbo frames, playing with cpu
   affinity for ceph and vm instances and trying different TCP congestion
   protocols gave no effect at all - with DCTCP I have slightly smoother
   network load graph and that`s all.
  
   Can ml please suggest anything to try to improve performance?
 
  Can you try setting
 
  rbd writeback window = 8192000
 
  or similar, and see what kind of effect that has? I suspect it'll speed
  up dd; I'm less sure about ext3.
 
  Thanks!
  sage
 
 
  
   ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org 
   (mailto:majord...@vger.kernel.org)
   More majordomo info at http://vger.kernel.org/majordomo-info.html
 


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-19 Thread Josh Durgin

On 03/19/2012 11:13 AM, Andrey Korolyov wrote:

Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:


I just want to clarify that the writeback window isn't a full writeback 
cache - it doesn't affect reads, and does not help with request merging 
etc. It simply allows a bunch of writes to be in flight while acking the 
write to the guest immediately. We're working on a full-fledged 
writeback cache that to replace the writeback window.



dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
  Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec


One thing to check is the size of the writes that are actually being 
sent by rbd. The guest is probably splitting them into relatively small 
(128 or 256k) writes. Ideally it would be sending 4k writes, and this 
should be a lot faster.


You can see the writes being sent by adding debug_ms=1 to the client or 
osd. The format is osd_op(.*[write OFFSET~LENGTH]).



Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
gregory.far...@dreamhost.com  wrote:

It sounds like maybe you're using Xen? The rbd writeback window option only 
works for userspace rbd implementations (eg, KVM).
If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
(~8MB).

What options are you running dd with? If you run a rados bench from both 
machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? 
(http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:


More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weils...@newdream.net 
(mailto:s...@newdream.net)  wrote:

On Sat, 17 Mar 2012, Andrey Korolyov wrote:

Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?


Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has? I suspect it'll speed
up dd; I'm less sure about ext3.

Thanks!
sage




ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at 

Re: Mysteriously poor write performance

2012-03-18 Thread Sage Weil
On Sat, 17 Mar 2012, Andrey Korolyov wrote:
 Hi,
 
 I`ve did some performance tests at the following configuration:
 
 mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
 dom0 with three dedicated cores and 1.5G, mostly idle. First three
 disks on each r410 arranged into raid0 and holds osd data when fourth
 holds os and osd` journal partition, all ceph-related stuff mounted on
 the ext4 without barriers.
 
 Firstly, I`ve noticed about a difference of benchmark performance and
 write speed through rbd from small kvm instance running on one of
 first two machines - when bench gave me about 110Mb/s, writing zeros
 to raw block device inside vm with dd was at top speed about 45 mb/s,
 for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
 Things get worse, when I`ve started second vm at second host and tried
 to continue same dd tests simultaneously - performance fairly divided
 by half for each instance :). Enabling jumbo frames, playing with cpu
 affinity for ceph and vm instances and trying different TCP congestion
 protocols gave no effect at all - with DCTCP I have slightly smoother
 network load graph and that`s all.
 
 Can ml please suggest anything to try to improve performance?

Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has?  I suspect it'll speed 
up dd; I'm less sure about ext3.

Thanks!
sage


 
 ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html