Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 02:03 AM, Andrey Korolyov wrote:

Hi Josh,

Can you please answer to list on this question? It is important when
someone wants to build HA KVM cluster on the rbd backend and needs to
wc cache. Thanks!

On Wed, May 23, 2012 at 10:30 AM, Josh Durgin  wrote:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:


Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-devel&m=133758599712768&w=2
will test tomorrow. Thanks.


Can we path this to the qemu-drive option?



Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

The normal qemu cache=writeback/writethrough/none option will work in qemu
1.2.

Josh


By the way, is it possible to flush cache outside? I may need that for
VM live migration and such hook will be helpful.


Qemu will do that for you in many cases, but it looks like we need to 
implement bdrv_invalidate_cache to make live migration work.


http://tracker.newdream.net/issues/2467

librbd itself flushes the cache when a snapshot is created or the image 
is closed, but there's no way to trigger it directly right now.


http://tracker.newdream.net/issues/2468
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Mark Nelson

On 5/23/12 2:22 AM, Andrey Korolyov wrote:

Hi,

For Stefan:

Increasing socket memory gave me about some percents on fio tests
inside VM(I have measured
'max-iops-until-ceph-throws-message-about-delayed-write' parameter).
What is more important, osd process, if possible, should be pinned to
dedicated core or two, and all other processes should not use this
core(you may do it via cg or manually), because even one four-core
non-pinned VM process during test causes a drop of osd` throughput
almost twice, same for any other heavy process on the host.

Very interesting!  Thanks for sharing.


net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216



On Wed, May 23, 2012 at 10:30 AM, Josh Durgin  wrote:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-devel&m=133758599712768&w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

The normal qemu cache=writeback/writethrough/none option will work in qemu
1.2.

Josh

By the way, is it possible to flush cache outside? I may need that for
VM live migration and such hook will be helpful.



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 10:30, schrieb Stefan Priebe - Profihost AG:
> Am 22.05.2012 23:11, schrieb Greg Farnum:
>> On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
>>> Am 22.05.2012 22:49, schrieb Greg Farnum:
 Anyway, it looks like you're just paying a synchronous write penalty
>>>  
>>>  
>>> What does that exactly mean? Shouldn't one threaded write to four  
>>> 260MB/s devices gives at least 100Mb/s?
>>
>> Well, with dd you've got a single thread issuing synchronous IO requests to 
>> the kernel. We could have it set up so that those synchronous requests get 
>> split up, but they aren't, and between the kernel and KVM it looks like when 
>> it needs to make a write out to disk it sends one request at a time to the 
>> Ceph backend. So you aren't writing to four 260MB/s devices; you are writing 
>> to one 260MB/s device without any pipelining — meaning you send off a 4MB 
>> write, then wait until it's done, then send off a second 4MB write, then 
>> wait until it's done, etc.
>> Frankly I'm surprised you aren't getting a bit more throughput than you're 
>> seeing (I remember other people getting much more out of less beefy boxes), 
>> but it doesn't much matter because what you really want to do is enable the 
>> client-side writeback cache in RBD, which will dispatch multiple requests at 
>> once and not force writes to be committed before reporting back to the 
>> kernel. Then you should indeed be writing to four 260MB/s devices at once. :)
> 
> OK i understand that but still the question where is the bottlenek in
> this case. I mean i see not more than 40% network load, not more than
> 10% cpu load and only 40MB/s to the SSD. I would still expect a network
> load of 70-90%.

*gr* i found a broken SATA cable ;-(

this is now with the replaced SATA cable and with rbd cache turned on:

systembootimage:/mnt# dd if=/dev/zero of=test bs=4M count=1000
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB) copied, 57,9194 s, 72,4 MB/s

systembootimage:/mnt# dd if=test of=/dev/null bs=4M count=1000
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB) copied, 46,3499 s, 90,5 MB/s

rados write bench 8 threads:
Total time run:60.222947
Total writes made: 1519
Write size:4194304
Bandwidth (MB/sec):100.892

Average Latency:   0.317098
Max latency:   1.88908
Min latency:   0.089681

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 22.05.2012 23:11, schrieb Greg Farnum:
> On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
>> Am 22.05.2012 22:49, schrieb Greg Farnum:
>>> Anyway, it looks like you're just paying a synchronous write penalty
>>  
>>  
>> What does that exactly mean? Shouldn't one threaded write to four  
>> 260MB/s devices gives at least 100Mb/s?
> 
> Well, with dd you've got a single thread issuing synchronous IO requests to 
> the kernel. We could have it set up so that those synchronous requests get 
> split up, but they aren't, and between the kernel and KVM it looks like when 
> it needs to make a write out to disk it sends one request at a time to the 
> Ceph backend. So you aren't writing to four 260MB/s devices; you are writing 
> to one 260MB/s device without any pipelining — meaning you send off a 4MB 
> write, then wait until it's done, then send off a second 4MB write, then wait 
> until it's done, etc.
> Frankly I'm surprised you aren't getting a bit more throughput than you're 
> seeing (I remember other people getting much more out of less beefy boxes), 
> but it doesn't much matter because what you really want to do is enable the 
> client-side writeback cache in RBD, which will dispatch multiple requests at 
> once and not force writes to be committed before reporting back to the 
> kernel. Then you should indeed be writing to four 260MB/s devices at once. :)

OK i understand that but still the question where is the bottlenek in
this case. I mean i see not more than 40% network load, not more than
10% cpu load and only 40MB/s to the SSD. I would still expect a network
load of 70-90%.

Greets and thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 01:20 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 09:19, schrieb Josh Durgin:

On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 08:30, schrieb Josh Durgin:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-devel&m=133758599712768&w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see
http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
can i add it just to the ceph.conf? Even with qemu 1.0?


You can use any of the rbd-specific options (like rbd_cache_max_dirty)
with qemu>= 0.15.

You can set them in a global ceph.conf file, or specify them on the qemu
command line like:

qemu -m 512 -drive
file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio


So this is enough for testing on kvm host?

/etc/ceph/ceph.conf

[global]
 auth supported = cephx
 keyring = /etc/ceph/$name.keyring
 rbd_cache = true
 rbd_cache_size = 32M


This should be a number in bytes - M/G/k/etc aren't parsed. Assuming you 
have the monitors listed below, this is fine. If you're not using the 
admin user, you'll need to add :id=name to the -drive string - it can't 
be set in the config file.



 rbd_cache_max_age = 2.0

...

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 09:19, schrieb Josh Durgin:
> On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:
>> Am 23.05.2012 08:30, schrieb Josh Durgin:
>>> On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:
 Hi,

>> So try enabling RBD writeback caching — see http://marc.info
>> /?l=ceph-devel&m=133758599712768&w=2
>> will test tomorrow. Thanks.
 Can we path this to the qemu-drive option?
>>>
>>> Yup, see
>>> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400
>> I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
>> can i add it just to the ceph.conf? Even with qemu 1.0?
> 
> You can use any of the rbd-specific options (like rbd_cache_max_dirty)
> with qemu >= 0.15.
> 
> You can set them in a global ceph.conf file, or specify them on the qemu
> command line like:
> 
> qemu -m 512 -drive
> file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio

So this is enough for testing on kvm host?

/etc/ceph/ceph.conf

[global]
auth supported = cephx
keyring = /etc/ceph/$name.keyring
rbd_cache = true
rbd_cache_size = 32M
rbd_cache_max_age = 2.0

...

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 09:22, schrieb Andrey Korolyov:
> Hi,
> 
> For Stefan:
> 
> Increasing socket memory gave me about some percents on fio tests
> inside VM(I have measured
> 'max-iops-until-ceph-throws-message-about-delayed-write' parameter).
> What is more important, osd process, if possible, should be pinned to
> dedicated core or two, and all other processes should not use this
> core(you may do it via cg or manually), because even one four-core
> non-pinned VM process during test causes a drop of osd` throughput
> almost twice, same for any other heavy process on the host.
Tried that using taskset but i didn't get any noticable boost. Also the
kernel already prevents jumping from core 2 core whenever possible. As
these machines are dedicated to osd there is no other load.

> net.core.rmem_max = 16777216
> net.core.wmem_max = 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 65536 16777216
This gaves me around 3-4 MB/s.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 12:22 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 09:19, schrieb Josh Durgin:

On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:
You can use any of the rbd-specific options (like rbd_cache_max_dirty)
with qemu>= 0.15.

You can set them in a global ceph.conf file, or specify them on the qemu
command line like:

qemu -m 512 -drive
file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio

ah thanks and sorry. Is there a way to verify which options are active /
working on a specific rbd block device?


There's no way to ask which options it's using while it's running. That
would probably be a good thing to add (maybe as an admin socket
command).

Until then, if you want to know the exact settings of all your rbd
disks, you can specify all the necessary options on the qemu command
line, and not have a ceph.conf file.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Andrey Korolyov
Hi,

For Stefan:

Increasing socket memory gave me about some percents on fio tests
inside VM(I have measured
'max-iops-until-ceph-throws-message-about-delayed-write' parameter).
What is more important, osd process, if possible, should be pinned to
dedicated core or two, and all other processes should not use this
core(you may do it via cg or manually), because even one four-core
non-pinned VM process during test causes a drop of osd` throughput
almost twice, same for any other heavy process on the host.

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216



On Wed, May 23, 2012 at 10:30 AM, Josh Durgin  wrote:
> On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:
>>
>> Hi,
>>
 So try enabling RBD writeback caching — see http://marc.info
 /?l=ceph-devel&m=133758599712768&w=2
 will test tomorrow. Thanks.
>>
>> Can we path this to the qemu-drive option?
>
>
> Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400
>
> The normal qemu cache=writeback/writethrough/none option will work in qemu
> 1.2.
>
> Josh

By the way, is it possible to flush cache outside? I may need that for
VM live migration and such hook will be helpful.


>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 09:19, schrieb Josh Durgin:
> On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:
> You can use any of the rbd-specific options (like rbd_cache_max_dirty)
> with qemu >= 0.15.
> 
> You can set them in a global ceph.conf file, or specify them on the qemu
> command line like:
> 
> qemu -m 512 -drive
> file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio
ah thanks and sorry. Is there a way to verify which options are active /
working on a specific rbd block device?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 08:30, schrieb Josh Durgin:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-devel&m=133758599712768&w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
can i add it just to the ceph.conf? Even with qemu 1.0?


You can use any of the rbd-specific options (like rbd_cache_max_dirty) 
with qemu >= 0.15.


You can set them in a global ceph.conf file, or specify them on the qemu 
command line like:


qemu -m 512 -drive 
file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio


Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 08:30, schrieb Josh Durgin:
> On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:
>> Hi,
>>
 So try enabling RBD writeback caching — see http://marc.info
 /?l=ceph-devel&m=133758599712768&w=2
 will test tomorrow. Thanks.
>> Can we path this to the qemu-drive option?
> 
> Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400
I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
can i add it just to the ceph.conf? Even with qemu 1.0?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Josh Durgin

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-devel&m=133758599712768&w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

The normal qemu cache=writeback/writethrough/none option will work in 
qemu 1.2.


Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe - Profihost AG
Hi,

>> So try enabling RBD writeback caching — see http://marc.info
>> /?l=ceph-devel&m=133758599712768&w=2
>> will test tomorrow. Thanks.
Can we path this to the qemu-drive option?

Stefan


Am 22.05.2012 23:11, schrieb Greg Farnum:
> On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
>> Am 22.05.2012 22:49, schrieb Greg Farnum:
>>> Anyway, it looks like you're just paying a synchronous write penalty
>>  
>>  
>> What does that exactly mean? Shouldn't one threaded write to four  
>> 260MB/s devices gives at least 100Mb/s?
> 
> Well, with dd you've got a single thread issuing synchronous IO requests to 
> the kernel. We could have it set up so that those synchronous requests get 
> split up, but they aren't, and between the kernel and KVM it looks like when 
> it needs to make a write out to disk it sends one request at a time to the 
> Ceph backend. So you aren't writing to four 260MB/s devices; you are writing 
> to one 260MB/s device without any pipelining — meaning you send off a 4MB 
> write, then wait until it's done, then send off a second 4MB write, then wait 
> until it's done, etc.
> Frankly I'm surprised you aren't getting a bit more throughput than you're 
> seeing (I remember other people getting much more out of less beefy boxes), 
> but it doesn't much matter because what you really want to do is enable the 
> client-side writeback cache in RBD, which will dispatch multiple requests at 
> once and not force writes to be committed before reporting back to the 
> kernel. Then you should indeed be writing to four 260MB/s devices at once. :)
> 
>>  
>>> since with 1 write at a time you're getting 30-40MB/s out of rados bench, 
>>> but with 16 you're getting>100MB/s.
>>> (If you bump up past 16 or increase the size of each with -b you may  
>>  
>> find yourself getting even more.)
>> yep noticed that.
>>  
>>> So try enabling RBD writeback caching — see 
>>> http://marc.info/?l=ceph-devel&m=133758599712768&w=2
>> will test tomorrow. Thanks.
>>  
>> Stefan  
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
> Am 22.05.2012 22:49, schrieb Greg Farnum:
> > Anyway, it looks like you're just paying a synchronous write penalty
>  
>  
> What does that exactly mean? Shouldn't one threaded write to four  
> 260MB/s devices gives at least 100Mb/s?

Well, with dd you've got a single thread issuing synchronous IO requests to the 
kernel. We could have it set up so that those synchronous requests get split 
up, but they aren't, and between the kernel and KVM it looks like when it needs 
to make a write out to disk it sends one request at a time to the Ceph backend. 
So you aren't writing to four 260MB/s devices; you are writing to one 260MB/s 
device without any pipelining — meaning you send off a 4MB write, then wait 
until it's done, then send off a second 4MB write, then wait until it's done, 
etc.
Frankly I'm surprised you aren't getting a bit more throughput than you're 
seeing (I remember other people getting much more out of less beefy boxes), but 
it doesn't much matter because what you really want to do is enable the 
client-side writeback cache in RBD, which will dispatch multiple requests at 
once and not force writes to be committed before reporting back to the kernel. 
Then you should indeed be writing to four 260MB/s devices at once. :)

>  
> > since with 1 write at a time you're getting 30-40MB/s out of rados bench, 
> > but with 16 you're getting>100MB/s.
> > (If you bump up past 16 or increase the size of each with -b you may  
>  
> find yourself getting even more.)
> yep noticed that.
>  
> > So try enabling RBD writeback caching — see 
> > http://marc.info/?l=ceph-devel&m=133758599712768&w=2
> will test tomorrow. Thanks.
>  
> Stefan  


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 22:49, schrieb Greg Farnum:

Anyway, it looks like you're just paying a synchronous write penalty
What does that exactly mean? Shouldn't one threaded write to four 
260MB/s devices gives at least 100Mb/s?



since with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 
16 you're getting>100MB/s.
>(If you bump up past 16 or increase the size of each with -b you may 
find yourself getting even more.)

yep noticed that.


So try enabling RBD writeback caching — see 
http://marc.info/?l=ceph-devel&m=133758599712768&w=2

will test tomorrow. Thanks.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 22:48, schrieb Mark Nelson:

Can you use something like iostat or collectl to check and see if the
write throughput to each SSD is roughly equal during your tests?
It is but just around 20-40MB/s. But they can write 260MB/s with 
sequential writes.


> Also, what FS are you using and how did you format/mount it?
just:
mkfs.xfs /dev/sdb1
mount options: noatime,nodiratime,nobarrier,discard

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 1:30 PM, Stefan Priebe wrote:
> Am 22.05.2012 21:52, schrieb Greg Farnum:
> > On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
> > Huh. That's less than I would expect. Especially since it ought to be going 
> > through the page cache.
> > What version of RBD is KVM using here?
>  
>  
> v0.47.1
>  
> > Can you (from the KVM host) run
> > "rados -p data bench seq 60 -t 1"
> > "rados -p data bench seq 60 -t 16"
> > and paste the final output from both?
>  
>  
> OK here it is first with write then with seq read.
>  
> # rados -p data bench 60 write -t 1
> # rados -p data bench 60 write -t 16
> # rados -p data bench 60 seq -t 1
> # rados -p data bench 60 seq -t 16
>  
> Output is here:
> http://pastebin.com/iFy8GS7i

Heh, yep, sorry about the commands — haven't run them personally in a while. :)

Anyway, it looks like you're just paying a synchronous write penalty, since 
with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 
you're getting >100MB/s. (If you bump up past 16 or increase the size of each 
with -b you may find yourself getting even more.)
So try enabling RBD writeback caching — see 
http://marc.info/?l=ceph-devel&m=133758599712768&w=2
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Mark Nelson

On 05/22/2012 03:30 PM, Stefan Priebe wrote:

Am 22.05.2012 21:52, schrieb Greg Farnum:

On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
Huh. That's less than I would expect. Especially since it ought to be 
going through the page cache.

What version of RBD is KVM using here?

v0.47.1


Can you (from the KVM host) run
"rados -p data bench seq 60 -t 1"
"rados -p data bench seq 60 -t 16"
and paste the final output from both?

OK here it is first with write then with seq read.

# rados -p data bench 60 write -t 1
# rados -p data bench 60 write -t 16
# rados -p data bench 60 seq -t 1
# rados -p data bench 60 seq -t 16

Output is here:
http://pastebin.com/iFy8GS7i

Thanks!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hi Stefan,

Can you use something like iostat or collectl to check and see if the 
write throughput to each SSD is roughly equal during your tests?  Also, 
what FS are you using and how did you format/mount it?  I've been doing 
some tests internally using 2 nodes with 5 OSDs each backed by SSDs for 
both data and journal and am seeing about 600MB/s from the client (over 
10GE) on a fresh ceph fs.


Mark


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 21:52, schrieb Greg Farnum:

On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

v0.47.1


Can you (from the KVM host) run
"rados -p data bench seq 60 -t 1"
"rados -p data bench seq 60 -t 16"
and paste the final output from both?

OK here it is first with write then with seq read.

# rados -p data bench 60 write -t 1
# rados -p data bench 60 write -t 16
# rados -p data bench 60 seq -t 1
# rados -p data bench 60 seq -t 16

Output is here:
http://pastebin.com/iFy8GS7i

Thanks!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 21:52, schrieb Greg Farnum:

On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

v0.47.1


Can you (from the KVM host) run
"rados -p data bench seq 60 -t 1"
"rados -p data bench seq 60 -t 16"
and paste the final output from both?

I think your meant:
 rados -p data bench 60 seq -t 1 ?

but even that results in:
~# rados -p data bench 60 seq -t 1
Must write data before running a read benchmark!
error during benchmark: -2
error 2: (2) No such file or directory

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
> Am 22.05.2012 21:35, schrieb Greg Farnum:
> > What does your test look like? With multiple large IOs in flight we can 
> > regularly fill up a 1GbE link on our test clusters. With smaller or fewer 
> > IOs in flight performance degrades accordingly.
> 
> 
> 
> iperf shows 950Mbit/s so this is OK (from KVM host to OSDs)
> 
> sorry:
> dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M 
> count=1000;
> 1000+0 records in
> 1000+0 records out
> 4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s
> 
> 1000+0 records in
> 1000+0 records out
> 4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s

Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

Can you (from the KVM host) run
"rados -p data bench seq 60 -t 1"
"rados -p data bench seq 60 -t 16"
and paste the final output from both?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 21:35, schrieb Greg Farnum:

What does your test look like? With multiple large IOs in flight we can 
regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs 
in flight performance degrades accordingly.


iperf shows 950Mbit/s so this is OK (from KVM host to OSDs)

sorry:
dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M 
count=1000;

1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s

1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s

Greets
Stefan


On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote:


Hi list,

my ceph block testcluster is now running fine.

Setup:
4x ceph servers
- 3x mon with /mon on local os SATA disk
- 4x OSD with /journal on tmpfs and /srv on intel ssd

all of them use 2x 1Gbit/s lacp trunk.

1x KVM Host system (2x 1Gbit/s lacp trunk)

With one KVM i do not get more than 40MB/s and my network link is just
at 40% of 1Gbit/s.

Is this expected? If not where can i start searching / debugging?

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
What does your test look like? With multiple large IOs in flight we can 
regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs 
in flight performance degrades accordingly. 


On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote:

> Hi list,
> 
> my ceph block testcluster is now running fine.
> 
> Setup:
> 4x ceph servers
> - 3x mon with /mon on local os SATA disk
> - 4x OSD with /journal on tmpfs and /srv on intel ssd
> 
> all of them use 2x 1Gbit/s lacp trunk.
> 
> 1x KVM Host system (2x 1Gbit/s lacp trunk)
> 
> With one KVM i do not get more than 40MB/s and my network link is just
> at 40% of 1Gbit/s.
> 
> Is this expected? If not where can i start searching / debugging?
> 
> Thanks,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 16:52, schrieb Andrey Korolyov:

Hi,

I`ve run in almost same problem about two months ago, and there is a
couple of corner cases: near-default tcp parameters, small journal
size, disks that are not backed by controller with NVRAM cache and
high load on osd` cpu caused by side processes. Finally, I have able
to achieve 115Mb/s for large linear writes on raw rbd block inside VM
with journal on tmpfs and osds on RAID0 built on top of three sata
disks.


which tcp parameters could you recommand? The journal size is 1Gb on 
tmpfs right now. Instead of 3 sata disks i'm using one intel ssd. the 
CPU is loaded by 10% on each osd max. A "ceph osd tell X bench" shows me 
260MB/s write on each OSD (intel ssd).


Greets
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Andrey Korolyov
Hi,

I`ve run in almost same problem about two months ago, and there is a
couple of corner cases: near-default tcp parameters, small journal
size, disks that are not backed by controller with NVRAM cache and
high load on osd` cpu caused by side processes. Finally, I have able
to achieve 115Mb/s for large linear writes on raw rbd block inside VM
with journal on tmpfs and osds on RAID0 built on top of three sata
disks.

On Tue, May 22, 2012 at 4:45 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi list,
>
> my ceph block testcluster is now running fine.
>
> Setup:
> 4x ceph servers
>  - 3x mon with /mon on local os SATA disk
>  - 4x OSD with /journal on tmpfs and /srv on intel ssd
>
> all of them use 2x 1Gbit/s lacp trunk.
>
> 1x KVM Host system (2x 1Gbit/s lacp trunk)
>
> With one KVM i do not get more than 40MB/s and my network link is just
> at 40% of 1Gbit/s.
>
> Is this expected? If not where can i start searching / debugging?
>
> Thanks,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe - Profihost AG
Hi list,

my ceph block testcluster is now running fine.

Setup:
4x ceph servers
  - 3x mon with /mon on local os SATA disk
  - 4x OSD with /journal on tmpfs and /srv on intel ssd

all of them use 2x 1Gbit/s lacp trunk.

1x KVM Host system (2x 1Gbit/s lacp trunk)

With one KVM i do not get more than 40MB/s and my network link is just
at 40% of 1Gbit/s.

Is this expected? If not where can i start searching / debugging?

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html