Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Hi,

 So try enabling RBD writeback caching — see http://marc.info
 /?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.
Can we path this to the qemu-drive option?

Stefan


Am 22.05.2012 23:11, schrieb Greg Farnum:
 On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
 Am 22.05.2012 22:49, schrieb Greg Farnum:
 Anyway, it looks like you're just paying a synchronous write penalty
  
  
 What does that exactly mean? Shouldn't one threaded write to four  
 260MB/s devices gives at least 100Mb/s?
 
 Well, with dd you've got a single thread issuing synchronous IO requests to 
 the kernel. We could have it set up so that those synchronous requests get 
 split up, but they aren't, and between the kernel and KVM it looks like when 
 it needs to make a write out to disk it sends one request at a time to the 
 Ceph backend. So you aren't writing to four 260MB/s devices; you are writing 
 to one 260MB/s device without any pipelining — meaning you send off a 4MB 
 write, then wait until it's done, then send off a second 4MB write, then wait 
 until it's done, etc.
 Frankly I'm surprised you aren't getting a bit more throughput than you're 
 seeing (I remember other people getting much more out of less beefy boxes), 
 but it doesn't much matter because what you really want to do is enable the 
 client-side writeback cache in RBD, which will dispatch multiple requests at 
 once and not force writes to be committed before reporting back to the 
 kernel. Then you should indeed be writing to four 260MB/s devices at once. :)
 
  
 since with 1 write at a time you're getting 30-40MB/s out of rados bench, 
 but with 16 you're getting100MB/s.
 (If you bump up past 16 or increase the size of each with -b you may  
  
 find yourself getting even more.)
 yep noticed that.
  
 So try enabling RBD writeback caching — see 
 http://marc.info/?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.
  
 Stefan  
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 08:30, schrieb Josh Durgin:
 On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:
 Hi,

 So try enabling RBD writeback caching — see http://marc.info
 /?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.
 Can we path this to the qemu-drive option?
 
 Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400
I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
can i add it just to the ceph.conf? Even with qemu 1.0?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 08:30, schrieb Josh Durgin:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-develm=133758599712768w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
can i add it just to the ceph.conf? Even with qemu 1.0?


You can use any of the rbd-specific options (like rbd_cache_max_dirty) 
with qemu = 0.15.


You can set them in a global ceph.conf file, or specify them on the qemu 
command line like:


qemu -m 512 -drive 
file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio


Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Andrey Korolyov
Hi,

For Stefan:

Increasing socket memory gave me about some percents on fio tests
inside VM(I have measured
'max-iops-until-ceph-throws-message-about-delayed-write' parameter).
What is more important, osd process, if possible, should be pinned to
dedicated core or two, and all other processes should not use this
core(you may do it via cg or manually), because even one four-core
non-pinned VM process during test causes a drop of osd` throughput
almost twice, same for any other heavy process on the host.

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216



On Wed, May 23, 2012 at 10:30 AM, Josh Durgin josh.dur...@inktank.com wrote:
 On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

 Hi,

 So try enabling RBD writeback caching — see http://marc.info
 /?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.

 Can we path this to the qemu-drive option?


 Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

 The normal qemu cache=writeback/writethrough/none option will work in qemu
 1.2.

 Josh

By the way, is it possible to flush cache outside? I may need that for
VM live migration and such hook will be helpful.



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 12:22 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 09:19, schrieb Josh Durgin:

On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:
You can use any of the rbd-specific options (like rbd_cache_max_dirty)
with qemu= 0.15.

You can set them in a global ceph.conf file, or specify them on the qemu
command line like:

qemu -m 512 -drive
file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio

ah thanks and sorry. Is there a way to verify which options are active /
working on a specific rbd block device?


There's no way to ask which options it's using while it's running. That
would probably be a good thing to add (maybe as an admin socket
command).

Until then, if you want to know the exact settings of all your rbd
disks, you can specify all the necessary options on the qemu command
line, and not have a ceph.conf file.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 09:22, schrieb Andrey Korolyov:
 Hi,
 
 For Stefan:
 
 Increasing socket memory gave me about some percents on fio tests
 inside VM(I have measured
 'max-iops-until-ceph-throws-message-about-delayed-write' parameter).
 What is more important, osd process, if possible, should be pinned to
 dedicated core or two, and all other processes should not use this
 core(you may do it via cg or manually), because even one four-core
 non-pinned VM process during test causes a drop of osd` throughput
 almost twice, same for any other heavy process on the host.
Tried that using taskset but i didn't get any noticable boost. Also the
kernel already prevents jumping from core 2 core whenever possible. As
these machines are dedicated to osd there is no other load.

 net.core.rmem_max = 16777216
 net.core.wmem_max = 16777216
 net.ipv4.tcp_rmem = 4096 87380 16777216
 net.ipv4.tcp_wmem = 4096 65536 16777216
This gaves me around 3-4 MB/s.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 09:19, schrieb Josh Durgin:
 On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:
 Am 23.05.2012 08:30, schrieb Josh Durgin:
 On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:
 Hi,

 So try enabling RBD writeback caching — see http://marc.info
 /?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.
 Can we path this to the qemu-drive option?

 Yup, see
 http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400
 I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
 can i add it just to the ceph.conf? Even with qemu 1.0?
 
 You can use any of the rbd-specific options (like rbd_cache_max_dirty)
 with qemu = 0.15.
 
 You can set them in a global ceph.conf file, or specify them on the qemu
 command line like:
 
 qemu -m 512 -drive
 file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio

So this is enough for testing on kvm host?

/etc/ceph/ceph.conf

[global]
auth supported = cephx
keyring = /etc/ceph/$name.keyring
rbd_cache = true
rbd_cache_size = 32M
rbd_cache_max_age = 2.0

...

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Josh Durgin

On 05/23/2012 01:20 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 09:19, schrieb Josh Durgin:

On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote:

Am 23.05.2012 08:30, schrieb Josh Durgin:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-develm=133758599712768w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see
http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ...
can i add it just to the ceph.conf? Even with qemu 1.0?


You can use any of the rbd-specific options (like rbd_cache_max_dirty)
with qemu= 0.15.

You can set them in a global ceph.conf file, or specify them on the qemu
command line like:

qemu -m 512 -drive
file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio


So this is enough for testing on kvm host?

/etc/ceph/ceph.conf

[global]
 auth supported = cephx
 keyring = /etc/ceph/$name.keyring
 rbd_cache = true
 rbd_cache_size = 32M


This should be a number in bytes - M/G/k/etc aren't parsed. Assuming you 
have the monitors listed below, this is fine. If you're not using the 
admin user, you'll need to add :id=name to the -drive string - it can't 
be set in the config file.



 rbd_cache_max_age = 2.0

...

Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 22.05.2012 23:11, schrieb Greg Farnum:
 On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
 Am 22.05.2012 22:49, schrieb Greg Farnum:
 Anyway, it looks like you're just paying a synchronous write penalty
  
  
 What does that exactly mean? Shouldn't one threaded write to four  
 260MB/s devices gives at least 100Mb/s?
 
 Well, with dd you've got a single thread issuing synchronous IO requests to 
 the kernel. We could have it set up so that those synchronous requests get 
 split up, but they aren't, and between the kernel and KVM it looks like when 
 it needs to make a write out to disk it sends one request at a time to the 
 Ceph backend. So you aren't writing to four 260MB/s devices; you are writing 
 to one 260MB/s device without any pipelining — meaning you send off a 4MB 
 write, then wait until it's done, then send off a second 4MB write, then wait 
 until it's done, etc.
 Frankly I'm surprised you aren't getting a bit more throughput than you're 
 seeing (I remember other people getting much more out of less beefy boxes), 
 but it doesn't much matter because what you really want to do is enable the 
 client-side writeback cache in RBD, which will dispatch multiple requests at 
 once and not force writes to be committed before reporting back to the 
 kernel. Then you should indeed be writing to four 260MB/s devices at once. :)

OK i understand that but still the question where is the bottlenek in
this case. I mean i see not more than 40% network load, not more than
10% cpu load and only 40MB/s to the SSD. I would still expect a network
load of 70-90%.

Greets and thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Stefan Priebe - Profihost AG
Am 23.05.2012 10:30, schrieb Stefan Priebe - Profihost AG:
 Am 22.05.2012 23:11, schrieb Greg Farnum:
 On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
 Am 22.05.2012 22:49, schrieb Greg Farnum:
 Anyway, it looks like you're just paying a synchronous write penalty
  
  
 What does that exactly mean? Shouldn't one threaded write to four  
 260MB/s devices gives at least 100Mb/s?

 Well, with dd you've got a single thread issuing synchronous IO requests to 
 the kernel. We could have it set up so that those synchronous requests get 
 split up, but they aren't, and between the kernel and KVM it looks like when 
 it needs to make a write out to disk it sends one request at a time to the 
 Ceph backend. So you aren't writing to four 260MB/s devices; you are writing 
 to one 260MB/s device without any pipelining — meaning you send off a 4MB 
 write, then wait until it's done, then send off a second 4MB write, then 
 wait until it's done, etc.
 Frankly I'm surprised you aren't getting a bit more throughput than you're 
 seeing (I remember other people getting much more out of less beefy boxes), 
 but it doesn't much matter because what you really want to do is enable the 
 client-side writeback cache in RBD, which will dispatch multiple requests at 
 once and not force writes to be committed before reporting back to the 
 kernel. Then you should indeed be writing to four 260MB/s devices at once. :)
 
 OK i understand that but still the question where is the bottlenek in
 this case. I mean i see not more than 40% network load, not more than
 10% cpu load and only 40MB/s to the SSD. I would still expect a network
 load of 70-90%.

*gr* i found a broken SATA cable ;-(

this is now with the replaced SATA cable and with rbd cache turned on:

systembootimage:/mnt# dd if=/dev/zero of=test bs=4M count=1000
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB) copied, 57,9194 s, 72,4 MB/s

systembootimage:/mnt# dd if=test of=/dev/null bs=4M count=1000
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB) copied, 46,3499 s, 90,5 MB/s

rados write bench 8 threads:
Total time run:60.222947
Total writes made: 1519
Write size:4194304
Bandwidth (MB/sec):100.892

Average Latency:   0.317098
Max latency:   1.88908
Min latency:   0.089681

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-23 Thread Mark Nelson

On 5/23/12 2:22 AM, Andrey Korolyov wrote:

Hi,

For Stefan:

Increasing socket memory gave me about some percents on fio tests
inside VM(I have measured
'max-iops-until-ceph-throws-message-about-delayed-write' parameter).
What is more important, osd process, if possible, should be pinned to
dedicated core or two, and all other processes should not use this
core(you may do it via cg or manually), because even one four-core
non-pinned VM process during test causes a drop of osd` throughput
almost twice, same for any other heavy process on the host.

Very interesting!  Thanks for sharing.


net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216



On Wed, May 23, 2012 at 10:30 AM, Josh Durginjosh.dur...@inktank.com  wrote:

On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote:

Hi,


So try enabling RBD writeback caching — see http://marc.info
/?l=ceph-develm=133758599712768w=2
will test tomorrow. Thanks.

Can we path this to the qemu-drive option?


Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400

The normal qemu cache=writeback/writethrough/none option will work in qemu
1.2.

Josh

By the way, is it possible to flush cache outside? I may need that for
VM live migration and such hook will be helpful.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Andrey Korolyov
Hi,

I`ve run in almost same problem about two months ago, and there is a
couple of corner cases: near-default tcp parameters, small journal
size, disks that are not backed by controller with NVRAM cache and
high load on osd` cpu caused by side processes. Finally, I have able
to achieve 115Mb/s for large linear writes on raw rbd block inside VM
with journal on tmpfs and osds on RAID0 built on top of three sata
disks.

On Tue, May 22, 2012 at 4:45 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi list,

 my ceph block testcluster is now running fine.

 Setup:
 4x ceph servers
  - 3x mon with /mon on local os SATA disk
  - 4x OSD with /journal on tmpfs and /srv on intel ssd

 all of them use 2x 1Gbit/s lacp trunk.

 1x KVM Host system (2x 1Gbit/s lacp trunk)

 With one KVM i do not get more than 40MB/s and my network link is just
 at 40% of 1Gbit/s.

 Is this expected? If not where can i start searching / debugging?

 Thanks,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 16:52, schrieb Andrey Korolyov:

Hi,

I`ve run in almost same problem about two months ago, and there is a
couple of corner cases: near-default tcp parameters, small journal
size, disks that are not backed by controller with NVRAM cache and
high load on osd` cpu caused by side processes. Finally, I have able
to achieve 115Mb/s for large linear writes on raw rbd block inside VM
with journal on tmpfs and osds on RAID0 built on top of three sata
disks.


which tcp parameters could you recommand? The journal size is 1Gb on 
tmpfs right now. Instead of 3 sata disks i'm using one intel ssd. the 
CPU is loaded by 10% on each osd max. A ceph osd tell X bench shows me 
260MB/s write on each OSD (intel ssd).


Greets
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
What does your test look like? With multiple large IOs in flight we can 
regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs 
in flight performance degrades accordingly. 


On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote:

 Hi list,
 
 my ceph block testcluster is now running fine.
 
 Setup:
 4x ceph servers
 - 3x mon with /mon on local os SATA disk
 - 4x OSD with /journal on tmpfs and /srv on intel ssd
 
 all of them use 2x 1Gbit/s lacp trunk.
 
 1x KVM Host system (2x 1Gbit/s lacp trunk)
 
 With one KVM i do not get more than 40MB/s and my network link is just
 at 40% of 1Gbit/s.
 
 Is this expected? If not where can i start searching / debugging?
 
 Thanks,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
 Am 22.05.2012 21:35, schrieb Greg Farnum:
  What does your test look like? With multiple large IOs in flight we can 
  regularly fill up a 1GbE link on our test clusters. With smaller or fewer 
  IOs in flight performance degrades accordingly.
 
 
 
 iperf shows 950Mbit/s so this is OK (from KVM host to OSDs)
 
 sorry:
 dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M 
 count=1000;
 1000+0 records in
 1000+0 records out
 4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s
 
 1000+0 records in
 1000+0 records out
 4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s

Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

Can you (from the KVM host) run
rados -p data bench seq 60 -t 1
rados -p data bench seq 60 -t 16
and paste the final output from both?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 21:52, schrieb Greg Farnum:

On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

v0.47.1


Can you (from the KVM host) run
rados -p data bench seq 60 -t 1
rados -p data bench seq 60 -t 16
and paste the final output from both?

I think your meant:
 rados -p data bench 60 seq -t 1 ?

but even that results in:
~# rados -p data bench 60 seq -t 1
Must write data before running a read benchmark!
error during benchmark: -2
error 2: (2) No such file or directory

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 21:52, schrieb Greg Farnum:

On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

v0.47.1


Can you (from the KVM host) run
rados -p data bench seq 60 -t 1
rados -p data bench seq 60 -t 16
and paste the final output from both?

OK here it is first with write then with seq read.

# rados -p data bench 60 write -t 1
# rados -p data bench 60 write -t 16
# rados -p data bench 60 seq -t 1
# rados -p data bench 60 seq -t 16

Output is here:
http://pastebin.com/iFy8GS7i

Thanks!

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 1:30 PM, Stefan Priebe wrote:
 Am 22.05.2012 21:52, schrieb Greg Farnum:
  On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
  Huh. That's less than I would expect. Especially since it ought to be going 
  through the page cache.
  What version of RBD is KVM using here?
  
  
 v0.47.1
  
  Can you (from the KVM host) run
  rados -p data bench seq 60 -t 1
  rados -p data bench seq 60 -t 16
  and paste the final output from both?
  
  
 OK here it is first with write then with seq read.
  
 # rados -p data bench 60 write -t 1
 # rados -p data bench 60 write -t 16
 # rados -p data bench 60 seq -t 1
 # rados -p data bench 60 seq -t 16
  
 Output is here:
 http://pastebin.com/iFy8GS7i

Heh, yep, sorry about the commands — haven't run them personally in a while. :)

Anyway, it looks like you're just paying a synchronous write penalty, since 
with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 
you're getting 100MB/s. (If you bump up past 16 or increase the size of each 
with -b you may find yourself getting even more.)
So try enabling RBD writeback caching — see 
http://marc.info/?l=ceph-develm=133758599712768w=2
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 22:48, schrieb Mark Nelson:

Can you use something like iostat or collectl to check and see if the
write throughput to each SSD is roughly equal during your tests?
It is but just around 20-40MB/s. But they can write 260MB/s with 
sequential writes.


 Also, what FS are you using and how did you format/mount it?
just:
mkfs.xfs /dev/sdb1
mount options: noatime,nodiratime,nobarrier,discard

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Stefan Priebe

Am 22.05.2012 22:49, schrieb Greg Farnum:

Anyway, it looks like you're just paying a synchronous write penalty
What does that exactly mean? Shouldn't one threaded write to four 
260MB/s devices gives at least 100Mb/s?



since with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 
16 you're getting100MB/s.
(If you bump up past 16 or increase the size of each with -b you may 
find yourself getting even more.)

yep noticed that.


So try enabling RBD writeback caching — see 
http://marc.info/?l=ceph-develm=133758599712768w=2

will test tomorrow. Thanks.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
 Am 22.05.2012 22:49, schrieb Greg Farnum:
  Anyway, it looks like you're just paying a synchronous write penalty
  
  
 What does that exactly mean? Shouldn't one threaded write to four  
 260MB/s devices gives at least 100Mb/s?

Well, with dd you've got a single thread issuing synchronous IO requests to the 
kernel. We could have it set up so that those synchronous requests get split 
up, but they aren't, and between the kernel and KVM it looks like when it needs 
to make a write out to disk it sends one request at a time to the Ceph backend. 
So you aren't writing to four 260MB/s devices; you are writing to one 260MB/s 
device without any pipelining — meaning you send off a 4MB write, then wait 
until it's done, then send off a second 4MB write, then wait until it's done, 
etc.
Frankly I'm surprised you aren't getting a bit more throughput than you're 
seeing (I remember other people getting much more out of less beefy boxes), but 
it doesn't much matter because what you really want to do is enable the 
client-side writeback cache in RBD, which will dispatch multiple requests at 
once and not force writes to be committed before reporting back to the kernel. 
Then you should indeed be writing to four 260MB/s devices at once. :)

  
  since with 1 write at a time you're getting 30-40MB/s out of rados bench, 
  but with 16 you're getting100MB/s.
  (If you bump up past 16 or increase the size of each with -b you may  
  
 find yourself getting even more.)
 yep noticed that.
  
  So try enabling RBD writeback caching — see 
  http://marc.info/?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.
  
 Stefan  


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html