Re: Mysteriously poor write performance

2012-03-27 Thread Samuel Just
Sorry for the delayed reply... I've been tracking some issues which
cause high latency on our test machines, and it may be responsible for
your problems as well.  Could you retry those runs with the same
debugging and 'journal dio' set to false?

Thanks for your patience,
-Sam

On Sat, Mar 24, 2012 at 12:09 PM, Andrey Korolyov  wrote:
> http://xdel.ru/downloads/ceph-logs-dbg/
>
> On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just  wrote:
>> (CCing the list)
>>
>> Actually, can you could re-do the rados bench run with 'debug journal
>> = 20' along with the other debugging?  That should give us better
>> information.
>>
>> -Sam
>>
>> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov  wrote:
>>> Hi Sam,
>>>
>>> Can you please suggest on where to start profiling osd? If the
>>> bottleneck has related to such non-complex things as directio speed,
>>> I`m sure that I was able to catch it long ago, even crossing around by
>>> results of other types of benchmarks at host system. I`ve just tried
>>> tmpfs under both journals, it has a small boost effect, as expected
>>> because of near-zero i/o delay. May be chunk distribution mechanism
>>> does not work well on such small amount of nodes but right now I don`t
>>> have necessary amount of hardware nodes to prove or disprove that.
>>>
>>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov  wrote:
 random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, 
 stdev=5770.28
  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


 On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  
 wrote:
> Our journal writes are actually sequential.  Could you send FIO
> results for sequential 4k writes osd.0's journal and osd.1's journal?
> -Sam
>
> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
>> FIO output for journal partition, directio enabled, seems good(same
>> results for ext4 on other single sata disks).
>>
>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, 
>> stdev=480.05
>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>> >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>     lat (msec): 500=0.04%
>>
>>
>>
>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  
>> wrote:
>>> (CCing the list)
>>>
>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>> we write the operation to the journal.  In this case, that operation
>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>> only allow a limited number of ops in flight at a time, so this
>>> latency is killing your throughput.  For comparison, the latency for
>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>> latency for writes to your osd.1 journal file?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  
 wrote:
> [global]
>       log dir = /ceph/out
>       log_file = ""
>       logger dir = /ceph/log
>       pid file = /ceph/out/$type$id.pid
> [mds]
>       pid file = /ceph/out/$name.pid
>       lockdep = 1

Re: Mysteriously poor write performance

2012-03-24 Thread Andrey Korolyov
http://xdel.ru/downloads/ceph-logs-dbg/

On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just  wrote:
> (CCing the list)
>
> Actually, can you could re-do the rados bench run with 'debug journal
> = 20' along with the other debugging?  That should give us better
> information.
>
> -Sam
>
> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov  wrote:
>> Hi Sam,
>>
>> Can you please suggest on where to start profiling osd? If the
>> bottleneck has related to such non-complex things as directio speed,
>> I`m sure that I was able to catch it long ago, even crossing around by
>> results of other types of benchmarks at host system. I`ve just tried
>> tmpfs under both journals, it has a small boost effect, as expected
>> because of near-zero i/o delay. May be chunk distribution mechanism
>> does not work well on such small amount of nodes but right now I don`t
>> have necessary amount of hardware nodes to prove or disprove that.
>>
>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov  wrote:
>>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>> Starting 1 process
>>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>>> >=64=0.0%
>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0%
>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0%
>>>     issued r/w: total=0/40960, short=0/0
>>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>>
>>>
>>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  wrote:
 Our journal writes are actually sequential.  Could you send FIO
 results for sequential 4k writes osd.0's journal and osd.1's journal?
 -Sam

 On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
> FIO output for journal partition, directio enabled, seems good(same
> results for ext4 on other single sata disks).
>
> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, 
> stdev=480.05
>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>     issued r/w: total=0/40960, short=0/0
>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>     lat (msec): 500=0.04%
>
>
>
> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  
> wrote:
>> (CCing the list)
>>
>> So, the problem isn't the bandwidth.  Before we respond to the client,
>> we write the operation to the journal.  In this case, that operation
>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>> only allow a limited number of ops in flight at a time, so this
>> latency is killing your throughput.  For comparison, the latency for
>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>> latency for writes to your osd.1 journal file?
>> -Sam
>>
>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>> not Megabits.
>>>
>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  
>>> wrote:
 [global]
       log dir = /ceph/out
       log_file = ""
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
 

Re: Mysteriously poor write performance

2012-03-23 Thread Samuel Just
(CCing the list)

Actually, can you could re-do the rados bench run with 'debug journal
= 20' along with the other debugging?  That should give us better
information.

-Sam

On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov  wrote:
> Hi Sam,
>
> Can you please suggest on where to start profiling osd? If the
> bottleneck has related to such non-complex things as directio speed,
> I`m sure that I was able to catch it long ago, even crossing around by
> results of other types of benchmarks at host system. I`ve just tried
> tmpfs under both journals, it has a small boost effect, as expected
> because of near-zero i/o delay. May be chunk distribution mechanism
> does not work well on such small amount of nodes but right now I don`t
> have necessary amount of hardware nodes to prove or disprove that.
>
> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov  wrote:
>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>
>>
>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  wrote:
>>> Our journal writes are actually sequential.  Could you send FIO
>>> results for sequential 4k writes osd.0's journal and osd.1's journal?
>>> -Sam
>>>
>>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
 FIO output for journal partition, directio enabled, seems good(same
 results for ext4 on other single sata disks).

 random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=21926
  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
     lat (msec): 500=0.04%



 On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  
 wrote:
> (CCing the list)
>
> So, the problem isn't the bandwidth.  Before we respond to the client,
> we write the operation to the journal.  In this case, that operation
> is taking >1s per operation on osd.1.  Both rbd and rados bench will
> only allow a limited number of ops in flight at a time, so this
> latency is killing your throughput.  For comparison, the latency for
> writing to the journal on osd.0 is < .3s.  Can you measure direct io
> latency for writes to your osd.1 journal file?
> -Sam
>
> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>> not Megabits.
>>
>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
>>> [global]
>>>       log dir = /ceph/out
>>>       log_file = ""
>>>       logger dir = /ceph/log
>>>       pid file = /ceph/out/$type$id.pid
>>> [mds]
>>>       pid file = /ceph/out/$name.pid
>>>       lockdep = 1
>>>       mds log max segments = 2
>>> [osd]
>>>       lockdep = 1
>>>       filestore_xattr_use_omap = 1
>>>       osd data = /ceph/dev/osd$id
>>>       osd journal = /ceph/meta/journal
>>>       osd journal size = 100
>>> [mon]
>>>       lockdep = 1
>>>       mon data = /ceph/dev/mon$id
>>> [mon.0]
>>>       host = 172.20.1.32
>>>       mon addr = 172.20.1.32:6789
>>> [mon.1]
>>>       host = 172.20.1.33
>>>       mon addr = 172.20.1.33:6789
>>> [mon.2]
>>>       host = 172.20.1.35
>>>       mon addr = 172.20.1.35:6789
>>> [osd.0]
>>>       host = 172.20.1.32
>>> [osd.1]
>>> 

Re: Mysteriously poor write performance

2012-03-22 Thread Andrey Korolyov
random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
  cpu  : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued r/w: total=0/40960, short=0/0
 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
 lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  wrote:
> Our journal writes are actually sequential.  Could you send FIO
> results for sequential 4k writes osd.0's journal and osd.1's journal?
> -Sam
>
> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
>> FIO output for journal partition, directio enabled, seems good(same
>> results for ext4 on other single sata disks).
>>
>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>     lat (msec): 500=0.04%
>>
>>
>>
>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  wrote:
>>> (CCing the list)
>>>
>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>> we write the operation to the journal.  In this case, that operation
>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>> only allow a limited number of ops in flight at a time, so this
>>> latency is killing your throughput.  For comparison, the latency for
>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>> latency for writes to your osd.1 journal file?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
> [global]
>       log dir = /ceph/out
>       log_file = ""
>       logger dir = /ceph/log
>       pid file = /ceph/out/$type$id.pid
> [mds]
>       pid file = /ceph/out/$name.pid
>       lockdep = 1
>       mds log max segments = 2
> [osd]
>       lockdep = 1
>       filestore_xattr_use_omap = 1
>       osd data = /ceph/dev/osd$id
>       osd journal = /ceph/meta/journal
>       osd journal size = 100
> [mon]
>       lockdep = 1
>       mon data = /ceph/dev/mon$id
> [mon.0]
>       host = 172.20.1.32
>       mon addr = 172.20.1.32:6789
> [mon.1]
>       host = 172.20.1.33
>       mon addr = 172.20.1.33:6789
> [mon.2]
>       host = 172.20.1.35
>       mon addr = 172.20.1.35:6789
> [osd.0]
>       host = 172.20.1.32
> [osd.1]
>       host = 172.20.1.33
> [mds.a]
>       host = 172.20.1.32
>
> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
> /dev/mapper/system-cephmeta on /ceph/meta type ext4 
> (rw,barrier=0,user_xattr)
> Simple performance tests on those fs shows ~133Mb/s for /ceph and
> metadata/. Also both machines do not hold anything else which may
> impact osd.
>
> Also please note of following:
>
> http://i.imgur.com/ZgFdO.png
>
> First two peaks are related to running rados bench, then goes cluster
> recreation, automated debian install and final peaks are dd test.
> Surely I can have more precise graphs, but current one probably enough
> to state a situation - rbd utilizing about a quarter of possible
> bandwidth(if we can count rados bench as 100%).
>
> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just  
> wrote:
>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>> with the osd.1 jo

Re: Mysteriously poor write performance

2012-03-22 Thread Samuel Just
Our journal writes are actually sequential.  Could you send FIO
results for sequential 4k writes osd.0's journal and osd.1's journal?
-Sam

On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
> FIO output for journal partition, directio enabled, seems good(same
> results for ext4 on other single sata disks).
>
> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued r/w: total=0/40960, short=0/0
>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>     lat (msec): 500=0.04%
>
>
>
> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  wrote:
>> (CCing the list)
>>
>> So, the problem isn't the bandwidth.  Before we respond to the client,
>> we write the operation to the journal.  In this case, that operation
>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>> only allow a limited number of ops in flight at a time, so this
>> latency is killing your throughput.  For comparison, the latency for
>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>> latency for writes to your osd.1 journal file?
>> -Sam
>>
>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>> not Megabits.
>>>
>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
 [global]
       log dir = /ceph/out
       log_file = ""
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
       host = 172.20.1.33
       mon addr = 172.20.1.33:6789
 [mon.2]
       host = 172.20.1.35
       mon addr = 172.20.1.35:6789
 [osd.0]
       host = 172.20.1.32
 [osd.1]
       host = 172.20.1.33
 [mds.a]
       host = 172.20.1.32

 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
 /dev/mapper/system-cephmeta on /ceph/meta type ext4 
 (rw,barrier=0,user_xattr)
 Simple performance tests on those fs shows ~133Mb/s for /ceph and
 metadata/. Also both machines do not hold anything else which may
 impact osd.

 Also please note of following:

 http://i.imgur.com/ZgFdO.png

 First two peaks are related to running rados bench, then goes cluster
 recreation, automated debian install and final peaks are dd test.
 Surely I can have more precise graphs, but current one probably enough
 to state a situation - rbd utilizing about a quarter of possible
 bandwidth(if we can count rados bench as 100%).

 On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just  
 wrote:
> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
> osd.1...  Could you post your ceph.conf?  Might there be a problem
> with the osd.1 journal disk?
> -Sam
>
> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov  wrote:
>> Oh, sorry - they probably inherited rights from log files, fixed.
>>
>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just  
>> wrote:
>>> I get 403 Forbidden when I try to download any of the files.
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov  
>>> wrote:
 http://xdel.ru/downloads/ceph-logs/

 1/ contains logs related to bench initiated at the osd0 machine and 2/
 - at osd1.

 On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just  
 wrote:
> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
> post osd.1's logs?
> -Sam
>
> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov  
> wrote:
>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>
>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any 
>> debug
>> output disabled and log_file set to 

Re: Mysteriously poor write performance

2012-03-21 Thread Samuel Just
(CCing the list)

So, the problem isn't the bandwidth.  Before we respond to the client,
we write the operation to the journal.  In this case, that operation
is taking >1s per operation on osd.1.  Both rbd and rados bench will
only allow a limited number of ops in flight at a time, so this
latency is killing your throughput.  For comparison, the latency for
writing to the journal on osd.0 is < .3s.  Can you measure direct io
latency for writes to your osd.1 journal file?
-Sam

On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
> not Megabits.
>
> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
>> [global]
>>       log dir = /ceph/out
>>       log_file = ""
>>       logger dir = /ceph/log
>>       pid file = /ceph/out/$type$id.pid
>> [mds]
>>       pid file = /ceph/out/$name.pid
>>       lockdep = 1
>>       mds log max segments = 2
>> [osd]
>>       lockdep = 1
>>       filestore_xattr_use_omap = 1
>>       osd data = /ceph/dev/osd$id
>>       osd journal = /ceph/meta/journal
>>       osd journal size = 100
>> [mon]
>>       lockdep = 1
>>       mon data = /ceph/dev/mon$id
>> [mon.0]
>>       host = 172.20.1.32
>>       mon addr = 172.20.1.32:6789
>> [mon.1]
>>       host = 172.20.1.33
>>       mon addr = 172.20.1.33:6789
>> [mon.2]
>>       host = 172.20.1.35
>>       mon addr = 172.20.1.35:6789
>> [osd.0]
>>       host = 172.20.1.32
>> [osd.1]
>>       host = 172.20.1.33
>> [mds.a]
>>       host = 172.20.1.32
>>
>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>> metadata/. Also both machines do not hold anything else which may
>> impact osd.
>>
>> Also please note of following:
>>
>> http://i.imgur.com/ZgFdO.png
>>
>> First two peaks are related to running rados bench, then goes cluster
>> recreation, automated debian install and final peaks are dd test.
>> Surely I can have more precise graphs, but current one probably enough
>> to state a situation - rbd utilizing about a quarter of possible
>> bandwidth(if we can count rados bench as 100%).
>>
>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just  wrote:
>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>> with the osd.1 journal disk?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov  wrote:
 Oh, sorry - they probably inherited rights from log files, fixed.

 On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just  
 wrote:
> I get 403 Forbidden when I try to download any of the files.
> -Sam
>
> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov  wrote:
>> http://xdel.ru/downloads/ceph-logs/
>>
>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>> - at osd1.
>>
>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just  
>> wrote:
>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>> post osd.1's logs?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov  wrote:
 Here, please: http://xdel.ru/downloads/ceph.log.gz

 Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
 output disabled and log_file set to the empty value, hope it`s okay.

 On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just  
 wrote:
> Can you set osd and filestore debugging to 20, restart the osds, run
> rados bench as before, and post the logs?
> -Sam Just
>
> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov  
> wrote:
>> rados bench 60 write -p data
>> 
>> Total time run:        61.217676
>> Total writes made:     989
>> Write size:            4194304
>> Bandwidth (MB/sec):    64.622
>>
>> Average Latency:       0.989608
>> Max latency:           2.21701
>> Min latency:           0.255315
>>
>> Here a snip from osd log, seems write size is okay.
>>
>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>> active+clean]  removing repgather(0x31b5360 applying 10'83 
>> rep_tid=597
>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 
>> [write
>> 1220608~4096] 0.17eb9fd8) v4)
>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>> rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)
>>
>> So

Re: Mysteriously poor write performance

2012-03-20 Thread Samuel Just
Can you set osd and filestore debugging to 20, restart the osds, run
rados bench as before, and post the logs?
-Sam Just

On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov  wrote:
> rados bench 60 write -p data
> 
> Total time run:        61.217676
> Total writes made:     989
> Write size:            4194304
> Bandwidth (MB/sec):    64.622
>
> Average Latency:       0.989608
> Max latency:           2.21701
> Min latency:           0.255315
>
> Here a snip from osd log, seems write size is okay.
>
> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write
> 1220608~4096] 0.17eb9fd8) v4)
> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
> active+clean]    q front is repgather(0x31b5360 applying 10'83
> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
> rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)
>
> Sorry for my previous question about rbd chunks, it was really stupid :)
>
> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin  
> wrote:
>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>
>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>> mentioned too small value and I`ve changed it to 64M before posting
>>> previous message with no success - both 8M and this value cause a
>>> performance drop. When I tried to wrote small amount of data that can
>>> be compared to writeback cache size(both on raw device and ext3 with
>>> sync option), following results were made:
>>
>>
>> I just want to clarify that the writeback window isn't a full writeback
>> cache - it doesn't affect reads, and does not help with request merging etc.
>> It simply allows a bunch of writes to be in flight while acking the write to
>> the guest immediately. We're working on a full-fledged writeback cache that
>> to replace the writeback window.
>>
>>
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>> same without oflag there and in the following samples)
>>> 10+0 records in
>>> 10+0 records out
>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>> 20+0 records in
>>> 20+0 records out
>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>> 30+0 records in
>>> 30+0 records out
>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>
>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>  Here the bench results, they`re almost equal on both nodes:
>>>
>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>
>>
>> One thing to check is the size of the writes that are actually being sent by
>> rbd. The guest is probably splitting them into relatively small (128 or
>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>> lot faster.
>>
>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>
>>
>>> Also, because I`ve not mentioned it before, network performance is
>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>> is not interrupt problem or something like it - even if ceph-osd,
>>> ethernet card queues and kvm instance pinned to different sets of
>>> cores, nothing changes.
>>>
>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>   wrote:

 It sounds like maybe you're using Xen? The "rbd writeback window" option
 only works for userspace rbd implementations (eg, KVM).
 If you are using KVM, you probably want 8192 (~80MB) rather than
 8192000 (~8MB).

 What options are you running dd with? If you run a rados bench from both
 machines, what do the results look like?
 Also, can you do the ceph osd bench on each of your OSDs, please?
 (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
 -Greg


 On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

> More strangely, writing speed drops down by fifteen percent when this
> option was set in vm` config(instead of result from
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> under heavy load.
>
> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil (mailto:s...@newdream.net)>  wrote:
>>
>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>
>>> Hi,
>>>
>>> I`ve did some performance test

Re: Mysteriously poor write performance

2012-03-20 Thread Andrey Korolyov
rados bench 60 write -p data

Total time run:61.217676
Total writes made: 989
Write size:4194304
Bandwidth (MB/sec):64.622

Average Latency:   0.989608
Max latency:   2.21701
Min latency:   0.255315

Here a snip from osd log, seems write size is okay.

2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write
1220608~4096] 0.17eb9fd8) v4)
2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]q front is repgather(0x31b5360 applying 10'83
rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)

Sorry for my previous question about rbd chunks, it was really stupid :)

On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin  wrote:
> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>
>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>> mentioned too small value and I`ve changed it to 64M before posting
>> previous message with no success - both 8M and this value cause a
>> performance drop. When I tried to wrote small amount of data that can
>> be compared to writeback cache size(both on raw device and ext3 with
>> sync option), following results were made:
>
>
> I just want to clarify that the writeback window isn't a full writeback
> cache - it doesn't affect reads, and does not help with request merging etc.
> It simply allows a bunch of writes to be in flight while acking the write to
> the guest immediately. We're working on a full-fledged writeback cache that
> to replace the writeback window.
>
>
>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>> same without oflag there and in the following samples)
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>> 20+0 records in
>> 20+0 records out
>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>> 30+0 records in
>> 30+0 records out
>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>
>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>> results _with_ writeback cache than without, as I`ve mentioned before.
>>  Here the bench results, they`re almost equal on both nodes:
>>
>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>
>
> One thing to check is the size of the writes that are actually being sent by
> rbd. The guest is probably splitting them into relatively small (128 or
> 256k) writes. Ideally it would be sending 4k writes, and this should be a
> lot faster.
>
> You can see the writes being sent by adding debug_ms=1 to the client or osd.
> The format is osd_op(.*[write OFFSET~LENGTH]).
>
>
>> Also, because I`ve not mentioned it before, network performance is
>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>> is not interrupt problem or something like it - even if ceph-osd,
>> ethernet card queues and kvm instance pinned to different sets of
>> cores, nothing changes.
>>
>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>   wrote:
>>>
>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>> only works for userspace rbd implementations (eg, KVM).
>>> If you are using KVM, you probably want 8192 (~80MB) rather than
>>> 8192000 (~8MB).
>>>
>>> What options are you running dd with? If you run a rados bench from both
>>> machines, what do the results look like?
>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>> -Greg
>>>
>>>
>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>
 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)>  wrote:
>
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all 

Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
Thanks to Greg, I have noticed very strange thing - data pool filled
with a bunch of objects like rb.0.0.04db with typical size
4194304 when original pool for guest os has size only 112(created as
40g). Seems that something went wrong, because on 0.42 I had more
impressive performance on cheaper hardware. For first time, I blamed
recent crash and recreated cluster from scratch about a hour ago, but
those objects created in a bare data/ pool with only one vm.




On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin  wrote:
> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>
>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>> mentioned too small value and I`ve changed it to 64M before posting
>> previous message with no success - both 8M and this value cause a
>> performance drop. When I tried to wrote small amount of data that can
>> be compared to writeback cache size(both on raw device and ext3 with
>> sync option), following results were made:
>
>
> I just want to clarify that the writeback window isn't a full writeback
> cache - it doesn't affect reads, and does not help with request merging etc.
> It simply allows a bunch of writes to be in flight while acking the write to
> the guest immediately. We're working on a full-fledged writeback cache that
> to replace the writeback window.
>
>
>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>> same without oflag there and in the following samples)
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>> 20+0 records in
>> 20+0 records out
>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>> 30+0 records in
>> 30+0 records out
>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>
>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>> results _with_ writeback cache than without, as I`ve mentioned before.
>>  Here the bench results, they`re almost equal on both nodes:
>>
>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>
>
> One thing to check is the size of the writes that are actually being sent by
> rbd. The guest is probably splitting them into relatively small (128 or
> 256k) writes. Ideally it would be sending 4k writes, and this should be a
> lot faster.
>
> You can see the writes being sent by adding debug_ms=1 to the client or osd.
> The format is osd_op(.*[write OFFSET~LENGTH]).
>
>
>> Also, because I`ve not mentioned it before, network performance is
>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>> is not interrupt problem or something like it - even if ceph-osd,
>> ethernet card queues and kvm instance pinned to different sets of
>> cores, nothing changes.
>>
>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>   wrote:
>>>
>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>> only works for userspace rbd implementations (eg, KVM).
>>> If you are using KVM, you probably want 8192 (~80MB) rather than
>>> 8192000 (~8MB).
>>>
>>> What options are you running dd with? If you run a rados bench from both
>>> machines, what do the results look like?
>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>> -Greg
>>>
>>>
>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>
 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)>  wrote:
>
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all ceph-related stuff mounted on
>> the ext4 without barriers.
>>
>> Firstly, I`ve noticed about a difference of benchmark performance and
>> write speed through rbd from small kvm instance running on one of
>> first two machines - when bench gave me about 110Mb/s, writing zeros
>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> Things get worse, when I`ve started second vm at second hos

Re: Mysteriously poor write performance

2012-03-19 Thread Josh Durgin

On 03/19/2012 11:13 AM, Andrey Korolyov wrote:

Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:


I just want to clarify that the writeback window isn't a full writeback 
cache - it doesn't affect reads, and does not help with request merging 
etc. It simply allows a bunch of writes to be in flight while acking the 
write to the guest immediately. We're working on a full-fledged 
writeback cache that to replace the writeback window.



dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
  Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec


One thing to check is the size of the writes that are actually being 
sent by rbd. The guest is probably splitting them into relatively small 
(128 or 256k) writes. Ideally it would be sending 4k writes, and this 
should be a lot faster.


You can see the writes being sent by adding debug_ms=1 to the client or 
osd. The format is osd_op(.*[write OFFSET~LENGTH]).



Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
  wrote:

It sounds like maybe you're using Xen? The "rbd writeback window" option only 
works for userspace rbd implementations (eg, KVM).
If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
(~8MB).

What options are you running dd with? If you run a rados bench from both 
machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? 
(http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:


More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weilmailto:s...@newdream.net)>  wrote:

On Sat, 17 Mar 2012, Andrey Korolyov wrote:

Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?


Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has? I suspect it'll speed
up dd; I'm less sure about ext3.

Thanks!
sage




ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe fr

Re: Mysteriously poor write performance

2012-03-19 Thread Greg Farnum
On Monday, March 19, 2012 at 11:13 AM, Andrey Korolyov wrote:
> Nope, I`m using KVM for rbd guests.

Ah, okay — I'm not sure what your reference to dom0 and mon2 meant, then?
  
> Surely I`ve been noticed that Sage
> mentioned too small value and I`ve changed it to 64M before posting
> previous message with no success - both 8M and this value cause a
> performance drop. When I tried to wrote small amount of data that can
> be compared to writeback cache size(both on raw device and ext3 with
> sync option), following results were made:
> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
> same without oflag there and in the following samples)
> 10+0 records in
> 10+0 records out
> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
> 20+0 records in
> 20+0 records out
> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
> 30+0 records in
> 30+0 records out
> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>  
> and so on. Reference test with bs=1M and count=2000 has slightly worse
> results _with_ writeback cache than without, as I`ve mentioned before.
> Here the bench results, they`re almost equal on both nodes:
>  
> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
Okay, this is all a little odd to me. Can you send along your ceph.conf (along 
with any other pool config changes you've made) and the output from a rados 
bench (60 seconds or so)?
-Greg
  
>  
> Also, because I`ve not mentioned it before, network performance is
> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
> is not interrupt problem or something like it - even if ceph-osd,
> ethernet card queues and kvm instance pinned to different sets of
> cores, nothing changes.
>  
> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
> mailto:gregory.far...@dreamhost.com)> wrote:
> > It sounds like maybe you're using Xen? The "rbd writeback window" option 
> > only works for userspace rbd implementations (eg, KVM).
> > If you are using KVM, you probably want 8192 (~80MB) rather than 
> > 8192000 (~8MB).
> >  
> > What options are you running dd with? If you run a rados bench from both 
> > machines, what do the results look like?
> > Also, can you do the ceph osd bench on each of your OSDs, please? 
> > (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
> > -Greg
> >  
> >  
> > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
> >  
> > > More strangely, writing speed drops down by fifteen percent when this
> > > option was set in vm` config(instead of result from
> > > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> > > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> > > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> > > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> > > under heavy load.
> > >  
> > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil  > > (mailto:s...@newdream.net)> wrote:
> > > > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> > > > > Hi,
> > > > >  
> > > > > I`ve did some performance tests at the following configuration:
> > > > >  
> > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> > > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
> > > > > disks on each r410 arranged into raid0 and holds osd data when fourth
> > > > > holds os and osd` journal partition, all ceph-related stuff mounted on
> > > > > the ext4 without barriers.
> > > > >  
> > > > > Firstly, I`ve noticed about a difference of benchmark performance and
> > > > > write speed through rbd from small kvm instance running on one of
> > > > > first two machines - when bench gave me about 110Mb/s, writing zeros
> > > > > to raw block device inside vm with dd was at top speed about 45 mb/s,
> > > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> > > > > Things get worse, when I`ve started second vm at second host and tried
> > > > > to continue same dd tests simultaneously - performance fairly divided
> > > > > by half for each instance :). Enabling jumbo frames, playing with cpu
> > > > > affinity for ceph and vm instances and trying different TCP congestion
> > > > > protocols gave no effect at all - with DCTCP I have slightly smoother
> > > > > network load graph and that`s all.
> > > > >  
> > > > > Can ml please suggest anything to try to improve performance?
> > > >  
> > > > Can you try setting
> > > >  
> > > > rbd writeback window = 8192000
> > > >  
> > > > or similar, and see what kind of effect that has? I suspect it'll speed
> > > > up dd; I'm less sure about ext3.
> > > >  
> > > > Thanks!
> > > > sage
> > > >  
> > > >  
> > > > >  
> > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> 

Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:
dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
 Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec

Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
 wrote:
> It sounds like maybe you're using Xen? The "rbd writeback window" option only 
> works for userspace rbd implementations (eg, KVM).
> If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
> (~8MB).
>
> What options are you running dd with? If you run a rados bench from both 
> machines, what do the results look like?
> Also, can you do the ceph osd bench on each of your OSDs, please? 
> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
> -Greg
>
>
> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>
>> More strangely, writing speed drops down by fifteen percent when this
>> option was set in vm` config(instead of result from
>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>> under heavy load.
>>
>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil > (mailto:s...@newdream.net)> wrote:
>> > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>> > > Hi,
>> > >
>> > > I`ve did some performance tests at the following configuration:
>> > >
>> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> > > disks on each r410 arranged into raid0 and holds osd data when fourth
>> > > holds os and osd` journal partition, all ceph-related stuff mounted on
>> > > the ext4 without barriers.
>> > >
>> > > Firstly, I`ve noticed about a difference of benchmark performance and
>> > > write speed through rbd from small kvm instance running on one of
>> > > first two machines - when bench gave me about 110Mb/s, writing zeros
>> > > to raw block device inside vm with dd was at top speed about 45 mb/s,
>> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> > > Things get worse, when I`ve started second vm at second host and tried
>> > > to continue same dd tests simultaneously - performance fairly divided
>> > > by half for each instance :). Enabling jumbo frames, playing with cpu
>> > > affinity for ceph and vm instances and trying different TCP congestion
>> > > protocols gave no effect at all - with DCTCP I have slightly smoother
>> > > network load graph and that`s all.
>> > >
>> > > Can ml please suggest anything to try to improve performance?
>> >
>> > Can you try setting
>> >
>> > rbd writeback window = 8192000
>> >
>> > or similar, and see what kind of effect that has? I suspect it'll speed
>> > up dd; I'm less sure about ext3.
>> >
>> > Thanks!
>> > sage
>> >
>> >
>> > >
>> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majord...@vger.kernel.org 
>> > > (mailto:majord...@vger.kernel.org)
>> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org 
>> (mailto:majord...@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-19 Thread Greg Farnum
It sounds like maybe you're using Xen? The "rbd writeback window" option only 
works for userspace rbd implementations (eg, KVM). 
If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
(~8MB).

What options are you running dd with? If you run a rados bench from both 
machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? 
(http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

> More strangely, writing speed drops down by fifteen percent when this
> option was set in vm` config(instead of result from
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> under heavy load.
> 
> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil  (mailto:s...@newdream.net)> wrote:
> > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> > > Hi,
> > > 
> > > I`ve did some performance tests at the following configuration:
> > > 
> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
> > > disks on each r410 arranged into raid0 and holds osd data when fourth
> > > holds os and osd` journal partition, all ceph-related stuff mounted on
> > > the ext4 without barriers.
> > > 
> > > Firstly, I`ve noticed about a difference of benchmark performance and
> > > write speed through rbd from small kvm instance running on one of
> > > first two machines - when bench gave me about 110Mb/s, writing zeros
> > > to raw block device inside vm with dd was at top speed about 45 mb/s,
> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> > > Things get worse, when I`ve started second vm at second host and tried
> > > to continue same dd tests simultaneously - performance fairly divided
> > > by half for each instance :). Enabling jumbo frames, playing with cpu
> > > affinity for ceph and vm instances and trying different TCP congestion
> > > protocols gave no effect at all - with DCTCP I have slightly smoother
> > > network load graph and that`s all.
> > > 
> > > Can ml please suggest anything to try to improve performance?
> > 
> > Can you try setting
> > 
> > rbd writeback window = 8192000
> > 
> > or similar, and see what kind of effect that has? I suspect it'll speed
> > up dd; I'm less sure about ext3.
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org 
> > > (mailto:majord...@vger.kernel.org)
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil  wrote:
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all ceph-related stuff mounted on
>> the ext4 without barriers.
>>
>> Firstly, I`ve noticed about a difference of benchmark performance and
>> write speed through rbd from small kvm instance running on one of
>> first two machines - when bench gave me about 110Mb/s, writing zeros
>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> Things get worse, when I`ve started second vm at second host and tried
>> to continue same dd tests simultaneously - performance fairly divided
>> by half for each instance :). Enabling jumbo frames, playing with cpu
>> affinity for ceph and vm instances and trying different TCP congestion
>> protocols gave no effect at all - with DCTCP I have slightly smoother
>> network load graph and that`s all.
>>
>> Can ml please suggest anything to try to improve performance?
>
> Can you try setting
>
>        rbd writeback window = 8192000
>
> or similar, and see what kind of effect that has?  I suspect it'll speed
> up dd; I'm less sure about ext3.
>
> Thanks!
> sage
>
>
>>
>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-18 Thread Sage Weil
On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> Hi,
> 
> I`ve did some performance tests at the following configuration:
> 
> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> dom0 with three dedicated cores and 1.5G, mostly idle. First three
> disks on each r410 arranged into raid0 and holds osd data when fourth
> holds os and osd` journal partition, all ceph-related stuff mounted on
> the ext4 without barriers.
> 
> Firstly, I`ve noticed about a difference of benchmark performance and
> write speed through rbd from small kvm instance running on one of
> first two machines - when bench gave me about 110Mb/s, writing zeros
> to raw block device inside vm with dd was at top speed about 45 mb/s,
> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> Things get worse, when I`ve started second vm at second host and tried
> to continue same dd tests simultaneously - performance fairly divided
> by half for each instance :). Enabling jumbo frames, playing with cpu
> affinity for ceph and vm instances and trying different TCP congestion
> protocols gave no effect at all - with DCTCP I have slightly smoother
> network load graph and that`s all.
> 
> Can ml please suggest anything to try to improve performance?

Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has?  I suspect it'll speed 
up dd; I'm less sure about ext3.

Thanks!
sage


> 
> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mysteriously poor write performance

2012-03-17 Thread Andrey Korolyov
Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?

ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html