Re: poor write performance

2013-04-22 Thread Sylvain Munaut
Hi,

> Unless Sylvian implemented this in his tool
> explicitly, it won't happen there either.

The small bench tool submits requests using the asynchronous API as
fast as possible, using a 1M chunk.
Then it just waits for all the completions to be done.

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-22 Thread Sage Weil
> You may want to try increasing your read_ahead_kb on the OSD data disks and
> see if that helps read speeds.

Jumping into this thread late, so I'm not sure if this was covered, but:

Remember that readahead on the OSDs will only help up to the size of the 
object (4MB).  To get good read performance in general what is really 
needed is for the librbd user to do readahead so that the next object(s)
are being fetched before they are needed.  I don't think this happens with 
'dd' (opening a block device as a file does not trigger the kernel VM 
readahead code, IIRC).  Unless Sylvian implemented this in his tool 
explicitly, it won't happen there either.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-22 Thread Mark Nelson

On 04/22/2013 07:01 AM, Mark Nelson wrote:

On 04/22/2013 06:48 AM, James Harper wrote:

My read speed is consistently around 40MB/second, and my write speed is
consistently around 22MB/second. I had expected better of read...


You may want to try increasing your read_ahead_kb on the OSD data disks
and see if that helps read speeds.



Default appears to be 128 and I was getting 40MB/second
Increasing to 256 takes me up to 48MB/second
Increasing to 512 takes me up to 53Mb/second

Any further increases don't do anything that I can measure

Is increasing read_ahead_kb good for general performance, or just for
impressing people with benchmarks? If the kernel spent time reading
ahead woult it hurt random read/write performance?


Potentially yes, but it depends on a lot of of factors.  I suspect that
increasing it may be acceptable on modern drives, but you'll need to do
some testing to see how it goes in practice.

If anyone on the list knows how many sectors per track is typical for
modern 1-3TB drives I'm dying to know. That would help us guess at how
much data can be writen/read on average without imposing any head
movement. :)



Aha, sorry to reply to my own mail.  I found some specifications for 
Hitachi drives at least:


http://www.hgst.com/tech/techlib.nsf/products/Ultrastar_7K4000

look at section 4.2 of the "Ultrastar 7K4000 OEM Specification" document.

It specifies 310ktpi, or 310,000 tracks/inch.

Via google I found that this drive is using 5 800GB platters, meaning 
there are 10 heads in this drive.  Using hitachi's specifications:


(7,814,037,168 sectors / (310,000 tracks / inch * 3.5 inches)) / 10 
heads * 512 bytes / sector = ~360KB/track head


So assuming my math is right, it looks like we can read up to around 
360KB of data before hitting a head switch.  Now unfortunately (or maybe 
fortunately!) this is just the average case.  Outer tracks will store 
more data than inner tracks, so depending on what portion of the disk 
you are doing the read from, you might introduce head switches more or 
less often.  It looks like even with a 256k or 512k read_ahead you 
probably won't introduce a next-cylinder seek that often, though from 
what I can find it's not going to be all that much more expensive vs a 
head switch (2-3ms vs 1-2ms).


Mark



Thanks

James





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-22 Thread Mark Nelson

On 04/22/2013 06:48 AM, James Harper wrote:

My read speed is consistently around 40MB/second, and my write speed is
consistently around 22MB/second. I had expected better of read...


You may want to try increasing your read_ahead_kb on the OSD data disks
and see if that helps read speeds.



Default appears to be 128 and I was getting 40MB/second
Increasing to 256 takes me up to 48MB/second
Increasing to 512 takes me up to 53Mb/second

Any further increases don't do anything that I can measure

Is increasing read_ahead_kb good for general performance, or just for 
impressing people with benchmarks? If the kernel spent time reading ahead woult 
it hurt random read/write performance?


Potentially yes, but it depends on a lot of of factors.  I suspect that 
increasing it may be acceptable on modern drives, but you'll need to do 
some testing to see how it goes in practice.


If anyone on the list knows how many sectors per track is typical for 
modern 1-3TB drives I'm dying to know. That would help us guess at how 
much data can be writen/read on average without imposing any head 
movement. :)




Thanks

James



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-22 Thread James Harper
> > My read speed is consistently around 40MB/second, and my write speed is
> > consistently around 22MB/second. I had expected better of read...
> 
> You may want to try increasing your read_ahead_kb on the OSD data disks
> and see if that helps read speeds.
> 

Default appears to be 128 and I was getting 40MB/second
Increasing to 256 takes me up to 48MB/second
Increasing to 512 takes me up to 53Mb/second

Any further increases don't do anything that I can measure

Is increasing read_ahead_kb good for general performance, or just for 
impressing people with benchmarks? If the kernel spent time reading ahead woult 
it hurt random read/write performance?

Thanks
 
James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-22 Thread James Harper
> > I upgraded to 0.60 and that seems to have made a big difference. If I kill 
> > off
> > one of my OSD's I get around 20MB/second throughput in live testing (test
> > restore of Xen Windows VM from USB backup), which is pretty much the
> > limit of the USB disk. If I reactivate the second OSD throughput drops back 
> > to
> > ~10MB/second which isn't as good but is much better than I was getting.
> >
> 
> Ah, are these disks both connected through USB(2?)?
> 

I guess I was a bit brief :)

Both my OSD disks are SATA attached. Inside a VM I have attached another disk 
which is attached to the host via USB. This disk contains a backup of a server 
(using Windows Server Backup) and am doing a test restore of it, with ceph 
holding the C: drive of the virtual server (eg the write target). What I was 
saying is that I would never expect more than about 20-30MB/s write speed in 
this test because that is going to be approximately the limit of the USB 
interface that the data is coming from. This is more a production test than a 
benchmark, and I was just iostat to monitor the throughput of the /dev/rbdX 
interfaces while doing the restore.

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-22 Thread Mark Nelson

On 04/22/2013 06:34 AM, James Harper wrote:

Hi,


Correct, but that's the theoretical maximum I was referring to. If I calculate

that I should be able to get 50MB/second then 30MB/second is acceptable
but 500KB/second is not :)

I have written a small benchmark for RBD :

https://gist.github.com/smunaut/5433222

It uses the librbd API directly without kernel client and queue
requests long in advance and this should give an "upper" bound to what
you can get at best.
It reads and writes the whole image, so I usually just create a 1 or 2
G image for testing.

Using two OSDs on two distinct recent 7200rpm drives (with journal on
the same disk as data), I get :

Read: 89.52 Mb/s (2147483648 bytes in 22877 ms)
Write: 10.62 Mb/s (2147483648 bytes in 192874 ms)



I like your benchmark tool!

How many replicas? With two OSD's with xfs on ~3yo 1TB disks with two replicas 
I get:

# ./a.out admin xen test
Read: 111.99 Mb/s (1073741824 bytes in 9144 ms)
Write: 29.68 Mb/s (1073741824 bytes in 34507 ms)

Which means I forgot to drop caches on the OSD's so I'm seeing the limit on my 
public network (single gigabit interface). After dropping caches I consistently 
get:

# ./a.out admin xen test
Read: 39.98 Mb/s (1073741824 bytes in 25614 ms)
Write: 23.11 Mb/s (1073741824 bytes in 44316 ms)

Journal is on the same disk. Network is... confusing :) but is basically public 
on a single gigabit and cluster on a bonded pair of gigabit links. The whole 
network thing is shared with my existing drbd cluster so performance may vary 
over time.

My read speed is consistently around 40MB/second, and my write speed is 
consistently around 22MB/second. I had expected better of read...


You may want to try increasing your read_ahead_kb on the OSD data disks 
and see if that helps read speeds.




While running, iostat on each osd reports a read rate of around 20MB/second 
(1/2 total on each) during read test and a rate of 40-60MB/second (~2x total on 
each) during write test, which is pretty much exactly right.

iperf on the cluster network (pair of gigabits bonded) gives me about 
1.97Gbits/second. iperf between osd and client is around 0.94Gbits/second.

changing the scheduler on the harddisk doesn't seem to make any difference, 
even when I set it to cfq which normally really sucks.

What ceph version are you using and what filesystem?

Thanks

James



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-22 Thread James Harper
> Hi,
> 
> > Correct, but that's the theoretical maximum I was referring to. If I 
> > calculate
> that I should be able to get 50MB/second then 30MB/second is acceptable
> but 500KB/second is not :)
> 
> I have written a small benchmark for RBD :
> 
> https://gist.github.com/smunaut/5433222
> 
> It uses the librbd API directly without kernel client and queue
> requests long in advance and this should give an "upper" bound to what
> you can get at best.
> It reads and writes the whole image, so I usually just create a 1 or 2
> G image for testing.
> 
> Using two OSDs on two distinct recent 7200rpm drives (with journal on
> the same disk as data), I get :
> 
> Read: 89.52 Mb/s (2147483648 bytes in 22877 ms)
> Write: 10.62 Mb/s (2147483648 bytes in 192874 ms)
> 

I like your benchmark tool!

How many replicas? With two OSD's with xfs on ~3yo 1TB disks with two replicas 
I get:

# ./a.out admin xen test
Read: 111.99 Mb/s (1073741824 bytes in 9144 ms)
Write: 29.68 Mb/s (1073741824 bytes in 34507 ms)

Which means I forgot to drop caches on the OSD's so I'm seeing the limit on my 
public network (single gigabit interface). After dropping caches I consistently 
get:

# ./a.out admin xen test
Read: 39.98 Mb/s (1073741824 bytes in 25614 ms)
Write: 23.11 Mb/s (1073741824 bytes in 44316 ms)

Journal is on the same disk. Network is... confusing :) but is basically public 
on a single gigabit and cluster on a bonded pair of gigabit links. The whole 
network thing is shared with my existing drbd cluster so performance may vary 
over time.

My read speed is consistently around 40MB/second, and my write speed is 
consistently around 22MB/second. I had expected better of read...

While running, iostat on each osd reports a read rate of around 20MB/second 
(1/2 total on each) during read test and a rate of 40-60MB/second (~2x total on 
each) during write test, which is pretty much exactly right.

iperf on the cluster network (pair of gigabits bonded) gives me about 
1.97Gbits/second. iperf between osd and client is around 0.94Gbits/second.

changing the scheduler on the harddisk doesn't seem to make any difference, 
even when I set it to cfq which normally really sucks.

What ceph version are you using and what filesystem?

Thanks

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-22 Thread Mark Nelson

On 04/22/2013 12:32 AM, James Harper wrote:


On 04/19/2013 08:30 PM, James Harper wrote:

rados -p  -b 4096 bench 300 seq -t 64


sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   0   0 0 0 0 0 - 0
read got -2
error during benchmark: -5
error 5: (5) Input/output error

not sure what that's about...



Oops... I typo'd --no-cleanup. Now I get:

 sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   0   0 0 0 0 0 - 0
   Total time run:0.243709
Total reads made: 1292
Read size:4096
Bandwidth (MB/sec):20.709

Average Latency:   0.0118838
Max latency:   0.031942
Min latency:   0.001445

So it finishes instantly without seeming to do much actual testing...


My bad.  I forgot to tell you to do a sync/flush on the OSDs after the
write test.  All of those reads are probably coming from pagecache.  The
good news is that this is demonstrating that reading 4k objects from
pagecache isn't insanely bad on your setup (for larger sustained loads I
see 4k object reads from pagecache hit up to around 100MB/s with
multiple clients on my test nodes).

On your OSD nodes try:

sync
echo 3 > /proc/sys/vm/drop_caches

right before you run the read test.



I tell it to test for 300 seconds and it tests for 0 seconds so I must be doing 
something else wrong.



It will try to read for up to 300 seconds, but if it runs out of data it 
stops.  Since you only wrote out something like 1300 4k objects, and you 
were reading at 20+MB/s, the test ran for under a second.



Whatever issue you are facing is probably down at the filestore level or
possible lower down yet.

How do your drives benchmark with something like fio doing random 4k
writes?  Are your drives dedicated for ceph?  What filesystem?  Also
what is the journal device you are using?



Drives are dedicated for ceph. I originally put my journals on /, but that was 
ext3 and my throughput went down even further so the journal shares the osd 
disk for now.

I upgraded to 0.60 and that seems to have made a big difference. If I kill off 
one of my OSD's I get around 20MB/second throughput in live testing (test 
restore of Xen Windows VM from USB backup), which is pretty much the limit of 
the USB disk. If I reactivate the second OSD throughput drops back to 
~10MB/second which isn't as good but is much better than I was getting.



Ah, are these disks both connected through USB(2?)?


Thanks

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-22 Thread Sylvain Munaut
Hi,


> Correct, but that's the theoretical maximum I was referring to. If I 
> calculate that I should be able to get 50MB/second then 30MB/second is 
> acceptable but 500KB/second is not :)

I have written a small benchmark for RBD :

https://gist.github.com/smunaut/5433222

It uses the librbd API directly without kernel client and queue
requests long in advance and this should give an "upper" bound to what
you can get at best.
It reads and writes the whole image, so I usually just create a 1 or 2
G image for testing.

Using two OSDs on two distinct recent 7200rpm drives (with journal on
the same disk as data), I get :

Read: 89.52 Mb/s (2147483648 bytes in 22877 ms)
Write: 10.62 Mb/s (2147483648 bytes in 192874 ms)


The raw disk do about 45 Mo/s when written by 1M chunk. But when
written by 4k chunk, this falls to ~500 ko/s ...

# dd if=/dev/zero of=/dev/xen-disks/test bs=1M oflag=direct
2049+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 49.3943 s, 43.5 MB/s

# dd if=/dev/zero of=/dev/xen-disks/test bs=4k oflag=direct
^C61667+0 records in
61667+0 records out
252588032 bytes (253 MB) copied, 539.123 s, 469 kB/s


Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-21 Thread James Harper
> 
> On 04/19/2013 08:30 PM, James Harper wrote:
> >>> rados -p  -b 4096 bench 300 seq -t 64
> >>
> >> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>   0   0 0 0 0 0 - 0
> >> read got -2
> >> error during benchmark: -5
> >> error 5: (5) Input/output error
> >>
> >> not sure what that's about...
> >>
> >
> > Oops... I typo'd --no-cleanup. Now I get:
> >
> > sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >   0   0 0 0 0 0 - 0
> >   Total time run:0.243709
> > Total reads made: 1292
> > Read size:4096
> > Bandwidth (MB/sec):20.709
> >
> > Average Latency:   0.0118838
> > Max latency:   0.031942
> > Min latency:   0.001445
> >
> > So it finishes instantly without seeming to do much actual testing...
> 
> My bad.  I forgot to tell you to do a sync/flush on the OSDs after the
> write test.  All of those reads are probably coming from pagecache.  The
> good news is that this is demonstrating that reading 4k objects from
> pagecache isn't insanely bad on your setup (for larger sustained loads I
> see 4k object reads from pagecache hit up to around 100MB/s with
> multiple clients on my test nodes).
> 
> On your OSD nodes try:
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> right before you run the read test.
> 

I tell it to test for 300 seconds and it tests for 0 seconds so I must be doing 
something else wrong.

> Whatever issue you are facing is probably down at the filestore level or
> possible lower down yet.
> 
> How do your drives benchmark with something like fio doing random 4k
> writes?  Are your drives dedicated for ceph?  What filesystem?  Also
> what is the journal device you are using?
> 

Drives are dedicated for ceph. I originally put my journals on /, but that was 
ext3 and my throughput went down even further so the journal shares the osd 
disk for now.

I upgraded to 0.60 and that seems to have made a big difference. If I kill off 
one of my OSD's I get around 20MB/second throughput in live testing (test 
restore of Xen Windows VM from USB backup), which is pretty much the limit of 
the USB disk. If I reactivate the second OSD throughput drops back to 
~10MB/second which isn't as good but is much better than I was getting.

Thanks

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-21 Thread James Harper
> Hi,
> 
> > My goal is 4 OSD's, each on separate machines, with 1 drive in each for a
> start, but I want to see performance of at least the same order of magnitude
> as the theoretical maximum on my hardware before I think about replacing
> my existing setup.
> 
> My current understanding is that it's not even possible, you always
> have a min 2/3x slow down in the best case.
> 
> If you do sustained sequential write benchmark, and have a single
> drive, then that drive ends up writing the data twice (journal + final
> storage area) which with the seeks will more than divide by 2 the peak
> perf of the drive. And since it's sequential, it will only write to 1
> PG at a time (so not divided among several OSD).
> 
> Also AFAIU the OSD receiving the data will also have to send the data
> to the other OSD in the PG and wait for them to say everything is
> written before confirming the write, which slows it even more.
> 

Correct, but that's the theoretical maximum I was referring to. If I calculate 
that I should be able to get 50MB/second then 30MB/second is acceptable but 
500KB/second is not :)

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-21 Thread Sylvain Munaut
Hi,


> My goal is 4 OSD's, each on separate machines, with 1 drive in each for a 
> start, but I want to see performance of at least the same order of magnitude 
> as the theoretical maximum on my hardware before I think about replacing my 
> existing setup.

My current understanding is that it's not even possible, you always
have a min 2/3x slow down in the best case.

If you do sustained sequential write benchmark, and have a single
drive, then that drive ends up writing the data twice (journal + final
storage area) which with the seeks will more than divide by 2 the peak
perf of the drive. And since it's sequential, it will only write to 1
PG at a time (so not divided among several OSD).

Also AFAIU the OSD receiving the data will also have to send the data
to the other OSD in the PG and wait for them to say everything is
written before confirming the write, which slows it even more.


Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-21 Thread Mark Nelson

On 04/19/2013 08:30 PM, James Harper wrote:

rados -p  -b 4096 bench 300 seq -t 64


sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   0 0 0 0 0 - 0
read got -2
error during benchmark: -5
error 5: (5) Input/output error

not sure what that's about...



Oops... I typo'd --no-cleanup. Now I get:

sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   0 0 0 0 0 - 0
  Total time run:0.243709
Total reads made: 1292
Read size:4096
Bandwidth (MB/sec):20.709

Average Latency:   0.0118838
Max latency:   0.031942
Min latency:   0.001445

So it finishes instantly without seeming to do much actual testing...


My bad.  I forgot to tell you to do a sync/flush on the OSDs after the 
write test.  All of those reads are probably coming from pagecache.  The 
good news is that this is demonstrating that reading 4k objects from 
pagecache isn't insanely bad on your setup (for larger sustained loads I 
see 4k object reads from pagecache hit up to around 100MB/s with 
multiple clients on my test nodes).


On your OSD nodes try:

sync
echo 3 > /proc/sys/vm/drop_caches

right before you run the read test.

Whatever issue you are facing is probably down at the filestore level or 
possible lower down yet.


How do your drives benchmark with something like fio doing random 4k 
writes?  Are your drives dedicated for ceph?  What filesystem?  Also 
what is the journal device you are using?


Mark



James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-20 Thread Jeff Mitchell

James Harper wrote:

Hi James,

do you VLAN's interfaces configured on your bonding interfaces? Because
I saw a similar situation in my setup.



No VLAN's on my bonding interface, although extensively used elsewhere.


What the OP described is *exactly* like a problem I've been struggling 
with. I thought the blame had lay elsewhere but maybe not.


My setup:

4 Ceph nodes, with 6 OSDs each and dual (bonded) 10GbE, with VLANs, 
running Precise. OSDs are using XFS. Replica count of 3. 3 of these are 
mons.
4 compute nodes, with dual (bonded) 10GbE, with VLANs, running a base of 
Precise along with a 3.6.3 Ceph-provided kernel, running KVM-based VMs. 
2 of these are also mons. VMs are Precise and accessing RBD through the 
kernel client.


(Eventually there will be 12 Ceph nodes. 5 mons seemed an appropriate 
number and when I've run into issues in the past I've actually gotten to 
cases where > 3 mons were knocked out, so 5 is a comfortable number 
unless it's problematic.)


In the VMs, I/O with ext4 is fine -- 10-15MB/s sustained. However, using 
ZFS (via ZFSonLinux, not FUSE), I see write speeds of about 150kb/sec, 
just like the OP.


I had figured that the problem lay with ZFS inside the VM (I've used 
ZFSonLinux on many bare metal machines without a problem for a couple of 
years now). The VMs were using virtio, and I'd heard that it was found 
that pre-1.4 Qemu versions could have some serious problems with virtio 
(which I didn't know at the time); also, I know that the kernel client 
is not the preferred client, and the version I'm using is a rather older 
version of the Ceph-provided builds. As a result, my plan was to try the 
updated Qemu version along with native Qemu librados RBD support once 
Raring was out, as I figured that the problem was either something in 
ZFSonLinux (though I reported the issue and nobody had ever heard of any 
such problem, or had any idea why it would be happening) or something 
specifically about ZFS running inside Qemu, as ext4 in the VMs is fine.


But, this thread has made me wonder if what's actually happening is in 
fact something else -- either something, as someone else saw, to do with 
using VLANs on the bonded interface (although I don't see such a write 
problem with any other traffic going through these VLANs); or, something 
about how ZFS inside the VM is writing to the RBD disk causing some kind 
of giant slowdown in Ceph. The numbers that the OP cited were exactly in 
line with what I was seeing.


I don't know offhand what the block sizes are that the kernel client was 
using, or that the different filesystems inside the VMs might be using 
when trying to write to their virtual disks (I'm guessing that if you 
are using virtio, as I am, it potentially could be anything). But 
perhaps ZFS writes extremely small blocks and ext4 doesn't.


Unfortunately, I don't have access to this testbed for the next few 
weeks, so for the moment I can only recount my experience and not 
actually test out any suggestions (unless I can corral someone with 
access to it to run tests).


Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-20 Thread James Harper
> 
> Hi James,
> 
> do you VLAN's interfaces configured on your bonding interfaces? Because
> I saw a similar situation in my setup.
> 

No VLAN's on my bonding interface, although extensively used elsewhere.

Thanks

James


Re: poor write performance

2013-04-20 Thread Harald Rößler
Hi James,

do you VLAN's interfaces configured on your bonding interfaces? Because
I saw a similar situation in my setup.

Kind Regards
Harald Roessler


On Fri, 2013-04-19 at 01:11 +0200, James Harper wrote:
> > 
> > Hi James,
> > 
> > This is just pure speculation, but can you assure that the bonding works
> > correctly? Maybe you have issues there. I have seen a lot of incorrectly
> > configured bonding throughout my life as unix admin.
> > 
> 
> The bonding gives me iperf performance consistent with 2 x 1GB links so I 
> think it's okay.
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mit freundlichen Grüßen,
Harald Rößler
 
. . . . . . . . . . . . . . . .
 
BTD System GmbH
Tel.: +49 (89) - 20 05 - 44 30
Tel.: +49 (89) - 660 291 - 251
Mob.: +49 (151) - 11 70 17 59
Fax:  +49 (89) 89 - 20 05 - 44 11
harald.roess...@btd.de
www.btd.de
Projektbüro Allianz-Arena • Ebene 4
Werner-Heisenberg-Allee 25 • D-80939 München 
Goethestraße 34 • D-80336 München
 
HRB München 154370
Geschäftsführer: Stefan Leibhard, Kersten Kröhl, Harald Rößler
 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 CONFIDENTIALITY NOTICE
 
This communication contains information which is confidential and may
also be privileged. It is for the exclusive use of the intended
recipient(s). If you are not the intended recipient(s), please note that
any distribution, copying or use of this communication or the
information in it is strictly  prohibited. If you have received this
communication in error, please notify us  immediately by telephone on
+49 (0) 89 - 20 05 - 44 00 and then destroy the  email and any copies of
it. This communication is from BTD System GmbH whose  office is at
Werner-Heisenberg-Allee 25, D-80939 München.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
.

N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: poor write performance

2013-04-19 Thread James Harper
> > rados -p  -b 4096 bench 300 seq -t 64
> 
> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  0   0 0 0 0 0 - 0
> read got -2
> error during benchmark: -5
> error 5: (5) Input/output error
> 
> not sure what that's about...
> 

Oops... I typo'd --no-cleanup. Now I get:

   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 Total time run:0.243709
Total reads made: 1292
Read size:4096
Bandwidth (MB/sec):20.709

Average Latency:   0.0118838
Max latency:   0.031942
Min latency:   0.001445

So it finishes instantly without seeming to do much actual testing...

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-19 Thread James Harper
> 
> On 04/19/2013 06:09 AM, James Harper wrote:
> > I just tried a 3.8 series kernel and can now get 25mbytes/second using dd
> with a 4mb block size, instead of the 700kbytes/second I was getting with the
> debian 3.2 kernel.
> 
> That's unexpected.  Was this the kernel on the client, the OSDs, or
> both?

Kernel on the client. I can't easily change the kernel on the OSD's although if 
you think it will make a big difference I can arrange it.

> >
> > I'm still getting 120kbytes/second with a dd 4kb block size though... is 
> > that
> expected?
> 
> that's still quite a bit lower than I'd expect as well.  What were your
> fs mount options on the OSDs?

I didn't explicitly set any, so I guess these are the defaults:

xfs (rw,noatime,attr2,delaylog,inode64,noquota)

> Can you try some rados bench read/write
> tests on your pool?  Something like:
> 
> rados -p  -b 4096 bench 300 write --no-cleanup -t 64

Ah. It's the --no-cleanup that explains why my pervious seq tests didn't work!

Total time run: 300.430516
Total writes made:  26726
Write size: 4096
Bandwidth (MB/sec): 0.347

Stddev Bandwidth:   0.322983
Max bandwidth (MB/sec): 1.34375
Min bandwidth (MB/sec): 0
Average Latency:0.719337
Stddev Latency: 0.985265
Max latency:7.2241
Min latency:0.018218

But then it just hung and I had to hit ctrl-c

What is the unit of measure for latency and for write size?

> rados -p  -b 4096 bench 300 seq -t 64

sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
read got -2
error during benchmark: -5
error 5: (5) Input/output error

not sure what that's about...

> 
> with 2 drives and 2x replication I wouldn't expect much without RBD
> cache, but 120kb/s is rather excessively bad. :)
> 

What is rbd cache? I've seen it mentioned but haven't found documentation for 
it anywhere...

My goal is 4 OSD's, each on separate machines, with 1 drive in each for a 
start, but I want to see performance of at least the same order of magnitude as 
the theoretical maximum on my hardware before I think about replacing my 
existing setup.

Thanks

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-19 Thread Mark Nelson

On 04/19/2013 06:09 AM, James Harper wrote:

I just tried a 3.8 series kernel and can now get 25mbytes/second using dd with 
a 4mb block size, instead of the 700kbytes/second I was getting with the debian 
3.2 kernel.


That's unexpected.  Was this the kernel on the client, the OSDs, or 
both?




I'm still getting 120kbytes/second with a dd 4kb block size though... is that 
expected?


that's still quite a bit lower than I'd expect as well.  What were your 
fs mount options on the OSDs?  Can you try some rados bench read/write 
tests on your pool?  Something like:


rados -p  -b 4096 bench 300 write --no-cleanup -t 64
rados -p  -b 4096 bench 300 seq -t 64

with 2 drives and 2x replication I wouldn't expect much without RBD 
cache, but 120kb/s is rather excessively bad. :)




James



Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-19 Thread James Harper
I just tried a 3.8 series kernel and can now get 25mbytes/second using dd with 
a 4mb block size, instead of the 700kbytes/second I was getting with the debian 
3.2 kernel.

I'm still getting 120kbytes/second with a dd 4kb block size though... is that 
expected?

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-19 Thread James Harper
> 
> I did an strace -c to gather some performance info, if that helps:
> 

Oops. Forgot to say that that's an strace -c of the osd process!

> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  78.13   39.5895492750 14398   967 futex
>  12.456.3087844200  1502   poll
>   7.994.048253  22490318 9 restart_syscall
>   0.650.331042 635   521   writev
>   0.340.172011   57337 3   SYS_344
>   0.220.110395 117   944   close
>   0.080.040002 310   129   truncate64
>   0.070.036003   12001 3   fsync
>   0.020.010611   1 10263   gettimeofday
>   0.020.0080001333 6   pwrite64
>   0.010.004941   9   521   fsetxattr
>   0.010.004256  33   129   sync_file_range
>   0.010.002779   1  3660   814 stat64
>   0.000.001775   4   442   sendmsg
>   0.000.001266   1  1507   recv
>   0.000.001103   1   948 4 open
>   0.000.000640   1   979   time
>   0.000.000493   1   409   clock_gettime
>   0.000.000375   1   522   _llseek
>   0.000.000111  1110   read
>   0.000.00   0 1   setxattr
>   0.000.00   0 1   getxattr
>   0.000.00   032 8 fgetxattr
>   0.000.00   0 5   statfs64
>   0.000.00   0 5 5 fallocate
> -- --- --- - - 
> 100.00   50.672389 36958  1807 total
> 
> Does that look about what you'd expect?
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-19 Thread James Harper
> 
> > > Where should I start looking for performance problems? I've tried
> running
> > > some of the benchmark stuff in the documentation but I haven't gotten
> very
> > > far...
> >
> > Hi James!  Sorry to hear about the performance trouble!  Is it just
> > sequential 4KB direct IO writes that are giving you troubles?  If you
> > are using the kernel version of RBD, we don't have any kind of cache
> > implemented there and since you are bypassing the pagecache on the
> > client, those writes are being sent to the different OSDs in 4KB chunks
> > over the network.  RBD stores data in blocks that are represented by 4MB
> > objects on one of the OSDs, so without cache a lot of sequential 4KB
> > writes will be hitting 1 OSD repeatedly and then moving on to the next
> > one.  Hopefully those writes would get aggregated at the OSD level, but
> > clearly that's not really happening here given your performance.
> 
> Using dd I tried various block sizes. With 4kb I was getting around
> 500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read
> performance seems great though.
> 

I did an strace -c to gather some performance info, if that helps:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 78.13   39.5895492750 14398   967 futex
 12.456.3087844200  1502   poll
  7.994.048253  22490318 9 restart_syscall
  0.650.331042 635   521   writev
  0.340.172011   57337 3   SYS_344
  0.220.110395 117   944   close
  0.080.040002 310   129   truncate64
  0.070.036003   12001 3   fsync
  0.020.010611   1 10263   gettimeofday
  0.020.0080001333 6   pwrite64
  0.010.004941   9   521   fsetxattr
  0.010.004256  33   129   sync_file_range
  0.010.002779   1  3660   814 stat64
  0.000.001775   4   442   sendmsg
  0.000.001266   1  1507   recv
  0.000.001103   1   948 4 open
  0.000.000640   1   979   time
  0.000.000493   1   409   clock_gettime
  0.000.000375   1   522   _llseek
  0.000.000111  1110   read
  0.000.00   0 1   setxattr
  0.000.00   0 1   getxattr
  0.000.00   032 8 fgetxattr
  0.000.00   0 5   statfs64
  0.000.00   0 5 5 fallocate
-- --- --- - - 
100.00   50.672389 36958  1807 total

Does that look about what you'd expect?

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-18 Thread James Harper
> > Where should I start looking for performance problems? I've tried running
> > some of the benchmark stuff in the documentation but I haven't gotten very
> > far...
> 
> Hi James!  Sorry to hear about the performance trouble!  Is it just
> sequential 4KB direct IO writes that are giving you troubles?  If you
> are using the kernel version of RBD, we don't have any kind of cache
> implemented there and since you are bypassing the pagecache on the
> client, those writes are being sent to the different OSDs in 4KB chunks
> over the network.  RBD stores data in blocks that are represented by 4MB
> objects on one of the OSDs, so without cache a lot of sequential 4KB
> writes will be hitting 1 OSD repeatedly and then moving on to the next
> one.  Hopefully those writes would get aggregated at the OSD level, but
> clearly that's not really happening here given your performance.

Using dd I tried various block sizes. With 4kb I was getting around 
500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read 
performance seems great though.

> Here's a couple of thoughts:
> 
> 1) If you are working with VMs, using the QEMU/KVM interface with virtio
> drivers and RBD cache enabled will give you a huge jump in small
> sequential write performance relative to what you are seeing now.

I'm using Xen so that won't work for me right now, although I did notice 
someone posted some blktap code to support ceph.

I'm trying a windows restore of a physical machine into a VM under Xen and 
performance matches what I am seeing with dd - very very slow.

> 2) You may want to try upgrading to 0.60.  We made a change to how the
> pg_log works that causes fewer disk seeks during small IO, especially
> with XFS.

Do packages for this exist for Debian? At the moment my sources.list contains 
"ceph.com/debian-bobtail wheezy main".

> 3) If you are still having trouble, testing your network, disk speeds,
> and using rados bench to test the object store all may be helpful.
> 

I tried that and while the write worked the seq test always said I had to do a 
write test first.

While running my Xen restore, /var/log/ceph/ceph.log looks like:

pgmap v18316: 832 pgs: 832 active+clean; 61443 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 824KB/s wr, 12op/s
pgmap v18317: 832 pgs: 832 active+clean; 61446 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 649KB/s wr, 10op/s
pgmap v18318: 832 pgs: 832 active+clean; 61449 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 652KB/s wr, 10op/s
pgmap v18319: 832 pgs: 832 active+clean; 61452 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 614KB/s wr, 9op/s
pgmap v18320: 832 pgs: 832 active+clean; 61454 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 537KB/s wr, 8op/s
pgmap v18321: 832 pgs: 832 active+clean; 61457 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 511KB/s wr, 7op/s

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: poor write performance

2013-04-18 Thread James Harper
> 
> Hi James,
> 
> This is just pure speculation, but can you assure that the bonding works
> correctly? Maybe you have issues there. I have seen a lot of incorrectly
> configured bonding throughout my life as unix admin.
> 

The bonding gives me iperf performance consistent with 2 x 1GB links so I think 
it's okay.

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-18 Thread Mark Nelson

On 04/18/2013 11:46 AM, Andrey Korolyov wrote:

On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson  wrote:

On 04/18/2013 06:46 AM, James Harper wrote:


I'm doing some basic testing so I'm not really fussed about poor
performance, but my write performance appears to be so bad I think I'm doing
something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb
block sizes, while read performance is acceptable (for testing at least).
For dd I'm using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration
directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD
with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
of the iSCSI. Performance not spectacular but acceptable. The servers in
question are the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running
some of the benchmark stuff in the documentation but I haven't gotten very
far...



Hi James!  Sorry to hear about the performance trouble!  Is it just
sequential 4KB direct IO writes that are giving you troubles?  If you are
using the kernel version of RBD, we don't have any kind of cache implemented
there and since you are bypassing the pagecache on the client, those writes
are being sent to the different OSDs in 4KB chunks over the network.  RBD
stores data in blocks that are represented by 4MB objects on one of the
OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
repeatedly and then moving on to the next one.  Hopefully those writes would
get aggregated at the OSD level, but clearly that's not really happening
here given your performance.

Here's a couple of thoughts:

1) If you are working with VMs, using the QEMU/KVM interface with virtio
drivers and RBD cache enabled will give you a huge jump in small sequential
write performance relative to what you are seeing now.

2) You may want to try upgrading to 0.60.  We made a change to how the
pg_log works that causes fewer disk seeks during small IO, especially with
XFS.


Can you point into related commits, if possible?


here you go:

http://tracker.ceph.com/projects/ceph/repository/revisions/188f3ea6867eeb6e950f6efed18d53ff17522bbc






3) If you are still having trouble, testing your network, disk speeds, and
using rados bench to test the object store all may be helpful.



Thanks

James



Good luck!




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-18 Thread Andrey Korolyov
On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson  wrote:
> On 04/18/2013 06:46 AM, James Harper wrote:
>>
>> I'm doing some basic testing so I'm not really fussed about poor
>> performance, but my write performance appears to be so bad I think I'm doing
>> something wrong.
>>
>> Using dd to test gives me kbytes/second for write performance for 4kb
>> block sizes, while read performance is acceptable (for testing at least).
>> For dd I'm using iflag=direct for read and oflag=direct for write testing.
>>
>> My setup, approximately, is:
>>
>> Two OSD's
>> . 1 x 7200RPM SATA disk each
>> . 2 x gigabit cluster network interfaces each in a bonded configuration
>> directly attached (osd to osd, no switch)
>> . 1 x gigabit public network
>> . journal on another spindle
>>
>> Three MON's
>> . 1 each on the OSD's
>> . 1 on another server, which is also the one used for testing performance
>>
>> I'm using debian packages from ceph which are version 0.56.4
>>
>> For comparison, my existing production storage is 2 servers running DRBD
>> with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
>> of the iSCSI. Performance not spectacular but acceptable. The servers in
>> question are the same specs as the servers I'm testing on.
>>
>> Where should I start looking for performance problems? I've tried running
>> some of the benchmark stuff in the documentation but I haven't gotten very
>> far...
>
>
> Hi James!  Sorry to hear about the performance trouble!  Is it just
> sequential 4KB direct IO writes that are giving you troubles?  If you are
> using the kernel version of RBD, we don't have any kind of cache implemented
> there and since you are bypassing the pagecache on the client, those writes
> are being sent to the different OSDs in 4KB chunks over the network.  RBD
> stores data in blocks that are represented by 4MB objects on one of the
> OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
> repeatedly and then moving on to the next one.  Hopefully those writes would
> get aggregated at the OSD level, but clearly that's not really happening
> here given your performance.
>
> Here's a couple of thoughts:
>
> 1) If you are working with VMs, using the QEMU/KVM interface with virtio
> drivers and RBD cache enabled will give you a huge jump in small sequential
> write performance relative to what you are seeing now.
>
> 2) You may want to try upgrading to 0.60.  We made a change to how the
> pg_log works that causes fewer disk seeks during small IO, especially with
> XFS.

Can you point into related commits, if possible?

>
> 3) If you are still having trouble, testing your network, disk speeds, and
> using rados bench to test the object store all may be helpful.
>
>>
>> Thanks
>>
>> James
>
>
> Good luck!
>
>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-18 Thread Mark Nelson

On 04/18/2013 06:46 AM, James Harper wrote:

I'm doing some basic testing so I'm not really fussed about poor performance, 
but my write performance appears to be so bad I think I'm doing something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb block 
sizes, while read performance is acceptable (for testing at least). For dd I'm 
using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration 
directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD with 
iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the 
iSCSI. Performance not spectacular but acceptable. The servers in question are 
the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running some 
of the benchmark stuff in the documentation but I haven't gotten very far...


Hi James!  Sorry to hear about the performance trouble!  Is it just 
sequential 4KB direct IO writes that are giving you troubles?  If you 
are using the kernel version of RBD, we don't have any kind of cache 
implemented there and since you are bypassing the pagecache on the 
client, those writes are being sent to the different OSDs in 4KB chunks 
over the network.  RBD stores data in blocks that are represented by 4MB 
objects on one of the OSDs, so without cache a lot of sequential 4KB 
writes will be hitting 1 OSD repeatedly and then moving on to the next 
one.  Hopefully those writes would get aggregated at the OSD level, but 
clearly that's not really happening here given your performance.


Here's a couple of thoughts:

1) If you are working with VMs, using the QEMU/KVM interface with virtio 
drivers and RBD cache enabled will give you a huge jump in small 
sequential write performance relative to what you are seeing now.


2) You may want to try upgrading to 0.60.  We made a change to how the 
pg_log works that causes fewer disk seeks during small IO, especially 
with XFS.


3) If you are still having trouble, testing your network, disk speeds, 
and using rados bench to test the object store all may be helpful.




Thanks

James


Good luck!



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-18 Thread Wolfgang Hennerbichler
Hi James,

This is just pure speculation, but can you assure that the bonding works
correctly? Maybe you have issues there. I have seen a lot of incorrectly
configured bonding throughout my life as unix admin.

Maybe this could help you a little:
http://www.wogri.at/Port-Channeling-802-3ad.338.0.html

On 04/18/2013 01:46 PM, James Harper wrote:
> I'm doing some basic testing so I'm not really fussed about poor performance, 
> but my write performance appears to be so bad I think I'm doing something 
> wrong.
> 
> Using dd to test gives me kbytes/second for write performance for 4kb block 
> sizes, while read performance is acceptable (for testing at least). For dd 
> I'm using iflag=direct for read and oflag=direct for write testing.
> 
> My setup, approximately, is:
> 
> Two OSD's
> . 1 x 7200RPM SATA disk each
> . 2 x gigabit cluster network interfaces each in a bonded configuration 
> directly attached (osd to osd, no switch)
> . 1 x gigabit public network
> . journal on another spindle
> 
> Three MON's
> . 1 each on the OSD's
> . 1 on another server, which is also the one used for testing performance
> 
> I'm using debian packages from ceph which are version 0.56.4
> 
> For comparison, my existing production storage is 2 servers running DRBD with 
> iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of 
> the iSCSI. Performance not spectacular but acceptable. The servers in 
> question are the same specs as the servers I'm testing on.
> 
> Where should I start looking for performance problems? I've tried running 
> some of the benchmark stuff in the documentation but I haven't gotten very 
> far...
> 
> Thanks
> 
> James
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


poor write performance

2013-04-18 Thread James Harper
I'm doing some basic testing so I'm not really fussed about poor performance, 
but my write performance appears to be so bad I think I'm doing something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb block 
sizes, while read performance is acceptable (for testing at least). For dd I'm 
using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration 
directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD with 
iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the 
iSCSI. Performance not spectacular but acceptable. The servers in question are 
the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running some 
of the benchmark stuff in the documentation but I haven't gotten very far...

Thanks

James

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-27 Thread Samuel Just
Sorry for the delayed reply... I've been tracking some issues which
cause high latency on our test machines, and it may be responsible for
your problems as well.  Could you retry those runs with the same
debugging and 'journal dio' set to false?

Thanks for your patience,
-Sam

On Sat, Mar 24, 2012 at 12:09 PM, Andrey Korolyov  wrote:
> http://xdel.ru/downloads/ceph-logs-dbg/
>
> On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just  wrote:
>> (CCing the list)
>>
>> Actually, can you could re-do the rados bench run with 'debug journal
>> = 20' along with the other debugging?  That should give us better
>> information.
>>
>> -Sam
>>
>> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov  wrote:
>>> Hi Sam,
>>>
>>> Can you please suggest on where to start profiling osd? If the
>>> bottleneck has related to such non-complex things as directio speed,
>>> I`m sure that I was able to catch it long ago, even crossing around by
>>> results of other types of benchmarks at host system. I`ve just tried
>>> tmpfs under both journals, it has a small boost effect, as expected
>>> because of near-zero i/o delay. May be chunk distribution mechanism
>>> does not work well on such small amount of nodes but right now I don`t
>>> have necessary amount of hardware nodes to prove or disprove that.
>>>
>>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov  wrote:
 random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, 
 stdev=5770.28
  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


 On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  
 wrote:
> Our journal writes are actually sequential.  Could you send FIO
> results for sequential 4k writes osd.0's journal and osd.1's journal?
> -Sam
>
> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
>> FIO output for journal partition, directio enabled, seems good(same
>> results for ext4 on other single sata disks).
>>
>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, 
>> stdev=480.05
>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>> >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>     lat (msec): 500=0.04%
>>
>>
>>
>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  
>> wrote:
>>> (CCing the list)
>>>
>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>> we write the operation to the journal.  In this case, that operation
>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>> only allow a limited number of ops in flight at a time, so this
>>> latency is killing your throughput.  For comparison, the latency for
>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>> latency for writes to your osd.1 journal file?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  
 wrote:
> [global]
>       log dir = /ceph/out
>       log_file = ""
>       logger dir = /ceph/log
>       pid file = /ceph/out/$type$id.pid
> [mds]
>       pid file = /ceph/out/$name.pid
>       lockdep = 1

Re: Mysteriously poor write performance

2012-03-24 Thread Andrey Korolyov
http://xdel.ru/downloads/ceph-logs-dbg/

On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just  wrote:
> (CCing the list)
>
> Actually, can you could re-do the rados bench run with 'debug journal
> = 20' along with the other debugging?  That should give us better
> information.
>
> -Sam
>
> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov  wrote:
>> Hi Sam,
>>
>> Can you please suggest on where to start profiling osd? If the
>> bottleneck has related to such non-complex things as directio speed,
>> I`m sure that I was able to catch it long ago, even crossing around by
>> results of other types of benchmarks at host system. I`ve just tried
>> tmpfs under both journals, it has a small boost effect, as expected
>> because of near-zero i/o delay. May be chunk distribution mechanism
>> does not work well on such small amount of nodes but right now I don`t
>> have necessary amount of hardware nodes to prove or disprove that.
>>
>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov  wrote:
>>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>> Starting 1 process
>>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>>> >=64=0.0%
>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0%
>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0%
>>>     issued r/w: total=0/40960, short=0/0
>>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>>
>>>
>>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  wrote:
 Our journal writes are actually sequential.  Could you send FIO
 results for sequential 4k writes osd.0's journal and osd.1's journal?
 -Sam

 On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
> FIO output for journal partition, directio enabled, seems good(same
> results for ext4 on other single sata disks).
>
> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, 
> stdev=480.05
>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>     issued r/w: total=0/40960, short=0/0
>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>     lat (msec): 500=0.04%
>
>
>
> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  
> wrote:
>> (CCing the list)
>>
>> So, the problem isn't the bandwidth.  Before we respond to the client,
>> we write the operation to the journal.  In this case, that operation
>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>> only allow a limited number of ops in flight at a time, so this
>> latency is killing your throughput.  For comparison, the latency for
>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>> latency for writes to your osd.1 journal file?
>> -Sam
>>
>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>> not Megabits.
>>>
>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  
>>> wrote:
 [global]
       log dir = /ceph/out
       log_file = ""
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
 

Re: Mysteriously poor write performance

2012-03-23 Thread Samuel Just
(CCing the list)

Actually, can you could re-do the rados bench run with 'debug journal
= 20' along with the other debugging?  That should give us better
information.

-Sam

On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov  wrote:
> Hi Sam,
>
> Can you please suggest on where to start profiling osd? If the
> bottleneck has related to such non-complex things as directio speed,
> I`m sure that I was able to catch it long ago, even crossing around by
> results of other types of benchmarks at host system. I`ve just tried
> tmpfs under both journals, it has a small boost effect, as expected
> because of near-zero i/o delay. May be chunk distribution mechanism
> does not work well on such small amount of nodes but right now I don`t
> have necessary amount of hardware nodes to prove or disprove that.
>
> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov  wrote:
>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>
>>
>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  wrote:
>>> Our journal writes are actually sequential.  Could you send FIO
>>> results for sequential 4k writes osd.0's journal and osd.1's journal?
>>> -Sam
>>>
>>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
 FIO output for journal partition, directio enabled, seems good(same
 results for ext4 on other single sata disks).

 random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
 Starting 1 process
 Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
 random-rw: (groupid=0, jobs=1): err= 0: pid=21926
  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
     lat (msec): 500=0.04%



 On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  
 wrote:
> (CCing the list)
>
> So, the problem isn't the bandwidth.  Before we respond to the client,
> we write the operation to the journal.  In this case, that operation
> is taking >1s per operation on osd.1.  Both rbd and rados bench will
> only allow a limited number of ops in flight at a time, so this
> latency is killing your throughput.  For comparison, the latency for
> writing to the journal on osd.0 is < .3s.  Can you measure direct io
> latency for writes to your osd.1 journal file?
> -Sam
>
> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>> not Megabits.
>>
>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
>>> [global]
>>>       log dir = /ceph/out
>>>       log_file = ""
>>>       logger dir = /ceph/log
>>>       pid file = /ceph/out/$type$id.pid
>>> [mds]
>>>       pid file = /ceph/out/$name.pid
>>>       lockdep = 1
>>>       mds log max segments = 2
>>> [osd]
>>>       lockdep = 1
>>>       filestore_xattr_use_omap = 1
>>>       osd data = /ceph/dev/osd$id
>>>       osd journal = /ceph/meta/journal
>>>       osd journal size = 100
>>> [mon]
>>>       lockdep = 1
>>>       mon data = /ceph/dev/mon$id
>>> [mon.0]
>>>       host = 172.20.1.32
>>>       mon addr = 172.20.1.32:6789
>>> [mon.1]
>>>       host = 172.20.1.33
>>>       mon addr = 172.20.1.33:6789
>>> [mon.2]
>>>       host = 172.20.1.35
>>>       mon addr = 172.20.1.35:6789
>>> [osd.0]
>>>       host = 172.20.1.32
>>> [osd.1]
>>> 

Re: Mysteriously poor write performance

2012-03-22 Thread Andrey Korolyov
random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
  cpu  : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued r/w: total=0/40960, short=0/0
 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
 lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just  wrote:
> Our journal writes are actually sequential.  Could you send FIO
> results for sequential 4k writes osd.0's journal and osd.1's journal?
> -Sam
>
> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
>> FIO output for journal partition, directio enabled, seems good(same
>> results for ext4 on other single sata disks).
>>
>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>     lat (msec): 500=0.04%
>>
>>
>>
>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  wrote:
>>> (CCing the list)
>>>
>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>> we write the operation to the journal.  In this case, that operation
>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>> only allow a limited number of ops in flight at a time, so this
>>> latency is killing your throughput.  For comparison, the latency for
>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>> latency for writes to your osd.1 journal file?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
 Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
 not Megabits.

 On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
> [global]
>       log dir = /ceph/out
>       log_file = ""
>       logger dir = /ceph/log
>       pid file = /ceph/out/$type$id.pid
> [mds]
>       pid file = /ceph/out/$name.pid
>       lockdep = 1
>       mds log max segments = 2
> [osd]
>       lockdep = 1
>       filestore_xattr_use_omap = 1
>       osd data = /ceph/dev/osd$id
>       osd journal = /ceph/meta/journal
>       osd journal size = 100
> [mon]
>       lockdep = 1
>       mon data = /ceph/dev/mon$id
> [mon.0]
>       host = 172.20.1.32
>       mon addr = 172.20.1.32:6789
> [mon.1]
>       host = 172.20.1.33
>       mon addr = 172.20.1.33:6789
> [mon.2]
>       host = 172.20.1.35
>       mon addr = 172.20.1.35:6789
> [osd.0]
>       host = 172.20.1.32
> [osd.1]
>       host = 172.20.1.33
> [mds.a]
>       host = 172.20.1.32
>
> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
> /dev/mapper/system-cephmeta on /ceph/meta type ext4 
> (rw,barrier=0,user_xattr)
> Simple performance tests on those fs shows ~133Mb/s for /ceph and
> metadata/. Also both machines do not hold anything else which may
> impact osd.
>
> Also please note of following:
>
> http://i.imgur.com/ZgFdO.png
>
> First two peaks are related to running rados bench, then goes cluster
> recreation, automated debian install and final peaks are dd test.
> Surely I can have more precise graphs, but current one probably enough
> to state a situation - rbd utilizing about a quarter of possible
> bandwidth(if we can count rados bench as 100%).
>
> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just  
> wrote:
>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>> with the osd.1 jo

Re: Mysteriously poor write performance

2012-03-22 Thread Samuel Just
Our journal writes are actually sequential.  Could you send FIO
results for sequential 4k writes osd.0's journal and osd.1's journal?
-Sam

On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov  wrote:
> FIO output for journal partition, directio enabled, seems good(same
> results for ext4 on other single sata disks).
>
> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued r/w: total=0/40960, short=0/0
>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>     lat (msec): 500=0.04%
>
>
>
> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just  wrote:
>> (CCing the list)
>>
>> So, the problem isn't the bandwidth.  Before we respond to the client,
>> we write the operation to the journal.  In this case, that operation
>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>> only allow a limited number of ops in flight at a time, so this
>> latency is killing your throughput.  For comparison, the latency for
>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>> latency for writes to your osd.1 journal file?
>> -Sam
>>
>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>> not Megabits.
>>>
>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
 [global]
       log dir = /ceph/out
       log_file = ""
       logger dir = /ceph/log
       pid file = /ceph/out/$type$id.pid
 [mds]
       pid file = /ceph/out/$name.pid
       lockdep = 1
       mds log max segments = 2
 [osd]
       lockdep = 1
       filestore_xattr_use_omap = 1
       osd data = /ceph/dev/osd$id
       osd journal = /ceph/meta/journal
       osd journal size = 100
 [mon]
       lockdep = 1
       mon data = /ceph/dev/mon$id
 [mon.0]
       host = 172.20.1.32
       mon addr = 172.20.1.32:6789
 [mon.1]
       host = 172.20.1.33
       mon addr = 172.20.1.33:6789
 [mon.2]
       host = 172.20.1.35
       mon addr = 172.20.1.35:6789
 [osd.0]
       host = 172.20.1.32
 [osd.1]
       host = 172.20.1.33
 [mds.a]
       host = 172.20.1.32

 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
 /dev/mapper/system-cephmeta on /ceph/meta type ext4 
 (rw,barrier=0,user_xattr)
 Simple performance tests on those fs shows ~133Mb/s for /ceph and
 metadata/. Also both machines do not hold anything else which may
 impact osd.

 Also please note of following:

 http://i.imgur.com/ZgFdO.png

 First two peaks are related to running rados bench, then goes cluster
 recreation, automated debian install and final peaks are dd test.
 Surely I can have more precise graphs, but current one probably enough
 to state a situation - rbd utilizing about a quarter of possible
 bandwidth(if we can count rados bench as 100%).

 On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just  
 wrote:
> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
> osd.1...  Could you post your ceph.conf?  Might there be a problem
> with the osd.1 journal disk?
> -Sam
>
> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov  wrote:
>> Oh, sorry - they probably inherited rights from log files, fixed.
>>
>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just  
>> wrote:
>>> I get 403 Forbidden when I try to download any of the files.
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov  
>>> wrote:
 http://xdel.ru/downloads/ceph-logs/

 1/ contains logs related to bench initiated at the osd0 machine and 2/
 - at osd1.

 On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just  
 wrote:
> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
> post osd.1's logs?
> -Sam
>
> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov  
> wrote:
>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>
>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any 
>> debug
>> output disabled and log_file set to 

Re: Mysteriously poor write performance

2012-03-21 Thread Samuel Just
(CCing the list)

So, the problem isn't the bandwidth.  Before we respond to the client,
we write the operation to the journal.  In this case, that operation
is taking >1s per operation on osd.1.  Both rbd and rados bench will
only allow a limited number of ops in flight at a time, so this
latency is killing your throughput.  For comparison, the latency for
writing to the journal on osd.0 is < .3s.  Can you measure direct io
latency for writes to your osd.1 journal file?
-Sam

On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov  wrote:
> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
> not Megabits.
>
> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov  wrote:
>> [global]
>>       log dir = /ceph/out
>>       log_file = ""
>>       logger dir = /ceph/log
>>       pid file = /ceph/out/$type$id.pid
>> [mds]
>>       pid file = /ceph/out/$name.pid
>>       lockdep = 1
>>       mds log max segments = 2
>> [osd]
>>       lockdep = 1
>>       filestore_xattr_use_omap = 1
>>       osd data = /ceph/dev/osd$id
>>       osd journal = /ceph/meta/journal
>>       osd journal size = 100
>> [mon]
>>       lockdep = 1
>>       mon data = /ceph/dev/mon$id
>> [mon.0]
>>       host = 172.20.1.32
>>       mon addr = 172.20.1.32:6789
>> [mon.1]
>>       host = 172.20.1.33
>>       mon addr = 172.20.1.33:6789
>> [mon.2]
>>       host = 172.20.1.35
>>       mon addr = 172.20.1.35:6789
>> [osd.0]
>>       host = 172.20.1.32
>> [osd.1]
>>       host = 172.20.1.33
>> [mds.a]
>>       host = 172.20.1.32
>>
>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>> metadata/. Also both machines do not hold anything else which may
>> impact osd.
>>
>> Also please note of following:
>>
>> http://i.imgur.com/ZgFdO.png
>>
>> First two peaks are related to running rados bench, then goes cluster
>> recreation, automated debian install and final peaks are dd test.
>> Surely I can have more precise graphs, but current one probably enough
>> to state a situation - rbd utilizing about a quarter of possible
>> bandwidth(if we can count rados bench as 100%).
>>
>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just  wrote:
>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>> with the osd.1 journal disk?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov  wrote:
 Oh, sorry - they probably inherited rights from log files, fixed.

 On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just  
 wrote:
> I get 403 Forbidden when I try to download any of the files.
> -Sam
>
> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov  wrote:
>> http://xdel.ru/downloads/ceph-logs/
>>
>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>> - at osd1.
>>
>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just  
>> wrote:
>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>> post osd.1's logs?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov  wrote:
 Here, please: http://xdel.ru/downloads/ceph.log.gz

 Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
 output disabled and log_file set to the empty value, hope it`s okay.

 On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just  
 wrote:
> Can you set osd and filestore debugging to 20, restart the osds, run
> rados bench as before, and post the logs?
> -Sam Just
>
> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov  
> wrote:
>> rados bench 60 write -p data
>> 
>> Total time run:        61.217676
>> Total writes made:     989
>> Write size:            4194304
>> Bandwidth (MB/sec):    64.622
>>
>> Average Latency:       0.989608
>> Max latency:           2.21701
>> Min latency:           0.255315
>>
>> Here a snip from osd log, seems write size is okay.
>>
>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>> active+clean]  removing repgather(0x31b5360 applying 10'83 
>> rep_tid=597
>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 
>> [write
>> 1220608~4096] 0.17eb9fd8) v4)
>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>> rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)
>>
>> So

Re: Mysteriously poor write performance

2012-03-20 Thread Samuel Just
Can you set osd and filestore debugging to 20, restart the osds, run
rados bench as before, and post the logs?
-Sam Just

On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov  wrote:
> rados bench 60 write -p data
> 
> Total time run:        61.217676
> Total writes made:     989
> Write size:            4194304
> Bandwidth (MB/sec):    64.622
>
> Average Latency:       0.989608
> Max latency:           2.21701
> Min latency:           0.255315
>
> Here a snip from osd log, seems write size is okay.
>
> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write
> 1220608~4096] 0.17eb9fd8) v4)
> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
> active+clean]    q front is repgather(0x31b5360 applying 10'83
> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
> rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)
>
> Sorry for my previous question about rbd chunks, it was really stupid :)
>
> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin  
> wrote:
>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>
>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>> mentioned too small value and I`ve changed it to 64M before posting
>>> previous message with no success - both 8M and this value cause a
>>> performance drop. When I tried to wrote small amount of data that can
>>> be compared to writeback cache size(both on raw device and ext3 with
>>> sync option), following results were made:
>>
>>
>> I just want to clarify that the writeback window isn't a full writeback
>> cache - it doesn't affect reads, and does not help with request merging etc.
>> It simply allows a bunch of writes to be in flight while acking the write to
>> the guest immediately. We're working on a full-fledged writeback cache that
>> to replace the writeback window.
>>
>>
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>> same without oflag there and in the following samples)
>>> 10+0 records in
>>> 10+0 records out
>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>> 20+0 records in
>>> 20+0 records out
>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>> 30+0 records in
>>> 30+0 records out
>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>
>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>  Here the bench results, they`re almost equal on both nodes:
>>>
>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>
>>
>> One thing to check is the size of the writes that are actually being sent by
>> rbd. The guest is probably splitting them into relatively small (128 or
>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>> lot faster.
>>
>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>
>>
>>> Also, because I`ve not mentioned it before, network performance is
>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>> is not interrupt problem or something like it - even if ceph-osd,
>>> ethernet card queues and kvm instance pinned to different sets of
>>> cores, nothing changes.
>>>
>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>   wrote:

 It sounds like maybe you're using Xen? The "rbd writeback window" option
 only works for userspace rbd implementations (eg, KVM).
 If you are using KVM, you probably want 8192 (~80MB) rather than
 8192000 (~8MB).

 What options are you running dd with? If you run a rados bench from both
 machines, what do the results look like?
 Also, can you do the ceph osd bench on each of your OSDs, please?
 (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
 -Greg


 On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

> More strangely, writing speed drops down by fifteen percent when this
> option was set in vm` config(instead of result from
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> under heavy load.
>
> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil (mailto:s...@newdream.net)>  wrote:
>>
>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>
>>> Hi,
>>>
>>> I`ve did some performance test

Re: Mysteriously poor write performance

2012-03-20 Thread Andrey Korolyov
rados bench 60 write -p data

Total time run:61.217676
Total writes made: 989
Write size:4194304
Bandwidth (MB/sec):64.622

Average Latency:   0.989608
Max latency:   2.21701
Min latency:   0.255315

Here a snip from osd log, seems write size is okay.

2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write
1220608~4096] 0.17eb9fd8) v4)
2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]q front is repgather(0x31b5360 applying 10'83
rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4)

Sorry for my previous question about rbd chunks, it was really stupid :)

On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin  wrote:
> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>
>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>> mentioned too small value and I`ve changed it to 64M before posting
>> previous message with no success - both 8M and this value cause a
>> performance drop. When I tried to wrote small amount of data that can
>> be compared to writeback cache size(both on raw device and ext3 with
>> sync option), following results were made:
>
>
> I just want to clarify that the writeback window isn't a full writeback
> cache - it doesn't affect reads, and does not help with request merging etc.
> It simply allows a bunch of writes to be in flight while acking the write to
> the guest immediately. We're working on a full-fledged writeback cache that
> to replace the writeback window.
>
>
>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>> same without oflag there and in the following samples)
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>> 20+0 records in
>> 20+0 records out
>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>> 30+0 records in
>> 30+0 records out
>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>
>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>> results _with_ writeback cache than without, as I`ve mentioned before.
>>  Here the bench results, they`re almost equal on both nodes:
>>
>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>
>
> One thing to check is the size of the writes that are actually being sent by
> rbd. The guest is probably splitting them into relatively small (128 or
> 256k) writes. Ideally it would be sending 4k writes, and this should be a
> lot faster.
>
> You can see the writes being sent by adding debug_ms=1 to the client or osd.
> The format is osd_op(.*[write OFFSET~LENGTH]).
>
>
>> Also, because I`ve not mentioned it before, network performance is
>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>> is not interrupt problem or something like it - even if ceph-osd,
>> ethernet card queues and kvm instance pinned to different sets of
>> cores, nothing changes.
>>
>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>   wrote:
>>>
>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>> only works for userspace rbd implementations (eg, KVM).
>>> If you are using KVM, you probably want 8192 (~80MB) rather than
>>> 8192000 (~8MB).
>>>
>>> What options are you running dd with? If you run a rados bench from both
>>> machines, what do the results look like?
>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>> -Greg
>>>
>>>
>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>
 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)>  wrote:
>
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all 

Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
Thanks to Greg, I have noticed very strange thing - data pool filled
with a bunch of objects like rb.0.0.04db with typical size
4194304 when original pool for guest os has size only 112(created as
40g). Seems that something went wrong, because on 0.42 I had more
impressive performance on cheaper hardware. For first time, I blamed
recent crash and recreated cluster from scratch about a hour ago, but
those objects created in a bare data/ pool with only one vm.




On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin  wrote:
> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>
>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>> mentioned too small value and I`ve changed it to 64M before posting
>> previous message with no success - both 8M and this value cause a
>> performance drop. When I tried to wrote small amount of data that can
>> be compared to writeback cache size(both on raw device and ext3 with
>> sync option), following results were made:
>
>
> I just want to clarify that the writeback window isn't a full writeback
> cache - it doesn't affect reads, and does not help with request merging etc.
> It simply allows a bunch of writes to be in flight while acking the write to
> the guest immediately. We're working on a full-fledged writeback cache that
> to replace the writeback window.
>
>
>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>> same without oflag there and in the following samples)
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>> 20+0 records in
>> 20+0 records out
>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>> 30+0 records in
>> 30+0 records out
>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>
>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>> results _with_ writeback cache than without, as I`ve mentioned before.
>>  Here the bench results, they`re almost equal on both nodes:
>>
>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>
>
> One thing to check is the size of the writes that are actually being sent by
> rbd. The guest is probably splitting them into relatively small (128 or
> 256k) writes. Ideally it would be sending 4k writes, and this should be a
> lot faster.
>
> You can see the writes being sent by adding debug_ms=1 to the client or osd.
> The format is osd_op(.*[write OFFSET~LENGTH]).
>
>
>> Also, because I`ve not mentioned it before, network performance is
>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>> is not interrupt problem or something like it - even if ceph-osd,
>> ethernet card queues and kvm instance pinned to different sets of
>> cores, nothing changes.
>>
>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>   wrote:
>>>
>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>> only works for userspace rbd implementations (eg, KVM).
>>> If you are using KVM, you probably want 8192 (~80MB) rather than
>>> 8192000 (~8MB).
>>>
>>> What options are you running dd with? If you run a rados bench from both
>>> machines, what do the results look like?
>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>> -Greg
>>>
>>>
>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>
 More strangely, writing speed drops down by fifteen percent when this
 option was set in vm` config(instead of result from
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
 As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
 recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
 under heavy load.

 On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)>  wrote:
>
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all ceph-related stuff mounted on
>> the ext4 without barriers.
>>
>> Firstly, I`ve noticed about a difference of benchmark performance and
>> write speed through rbd from small kvm instance running on one of
>> first two machines - when bench gave me about 110Mb/s, writing zeros
>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> Things get worse, when I`ve started second vm at second hos

Re: Mysteriously poor write performance

2012-03-19 Thread Josh Durgin

On 03/19/2012 11:13 AM, Andrey Korolyov wrote:

Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:


I just want to clarify that the writeback window isn't a full writeback 
cache - it doesn't affect reads, and does not help with request merging 
etc. It simply allows a bunch of writes to be in flight while acking the 
write to the guest immediately. We're working on a full-fledged 
writeback cache that to replace the writeback window.



dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
  Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec


One thing to check is the size of the writes that are actually being 
sent by rbd. The guest is probably splitting them into relatively small 
(128 or 256k) writes. Ideally it would be sending 4k writes, and this 
should be a lot faster.


You can see the writes being sent by adding debug_ms=1 to the client or 
osd. The format is osd_op(.*[write OFFSET~LENGTH]).



Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
  wrote:

It sounds like maybe you're using Xen? The "rbd writeback window" option only 
works for userspace rbd implementations (eg, KVM).
If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
(~8MB).

What options are you running dd with? If you run a rados bench from both 
machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? 
(http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:


More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weilmailto:s...@newdream.net)>  wrote:

On Sat, 17 Mar 2012, Andrey Korolyov wrote:

Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?


Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has? I suspect it'll speed
up dd; I'm less sure about ext3.

Thanks!
sage




ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe fr

Re: Mysteriously poor write performance

2012-03-19 Thread Greg Farnum
On Monday, March 19, 2012 at 11:13 AM, Andrey Korolyov wrote:
> Nope, I`m using KVM for rbd guests.

Ah, okay — I'm not sure what your reference to dom0 and mon2 meant, then?
  
> Surely I`ve been noticed that Sage
> mentioned too small value and I`ve changed it to 64M before posting
> previous message with no success - both 8M and this value cause a
> performance drop. When I tried to wrote small amount of data that can
> be compared to writeback cache size(both on raw device and ext3 with
> sync option), following results were made:
> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
> same without oflag there and in the following samples)
> 10+0 records in
> 10+0 records out
> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
> 20+0 records in
> 20+0 records out
> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
> 30+0 records in
> 30+0 records out
> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>  
> and so on. Reference test with bs=1M and count=2000 has slightly worse
> results _with_ writeback cache than without, as I`ve mentioned before.
> Here the bench results, they`re almost equal on both nodes:
>  
> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
Okay, this is all a little odd to me. Can you send along your ceph.conf (along 
with any other pool config changes you've made) and the output from a rados 
bench (60 seconds or so)?
-Greg
  
>  
> Also, because I`ve not mentioned it before, network performance is
> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
> is not interrupt problem or something like it - even if ceph-osd,
> ethernet card queues and kvm instance pinned to different sets of
> cores, nothing changes.
>  
> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
> mailto:gregory.far...@dreamhost.com)> wrote:
> > It sounds like maybe you're using Xen? The "rbd writeback window" option 
> > only works for userspace rbd implementations (eg, KVM).
> > If you are using KVM, you probably want 8192 (~80MB) rather than 
> > 8192000 (~8MB).
> >  
> > What options are you running dd with? If you run a rados bench from both 
> > machines, what do the results look like?
> > Also, can you do the ceph osd bench on each of your OSDs, please? 
> > (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
> > -Greg
> >  
> >  
> > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
> >  
> > > More strangely, writing speed drops down by fifteen percent when this
> > > option was set in vm` config(instead of result from
> > > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> > > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> > > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> > > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> > > under heavy load.
> > >  
> > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil  > > (mailto:s...@newdream.net)> wrote:
> > > > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> > > > > Hi,
> > > > >  
> > > > > I`ve did some performance tests at the following configuration:
> > > > >  
> > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> > > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
> > > > > disks on each r410 arranged into raid0 and holds osd data when fourth
> > > > > holds os and osd` journal partition, all ceph-related stuff mounted on
> > > > > the ext4 without barriers.
> > > > >  
> > > > > Firstly, I`ve noticed about a difference of benchmark performance and
> > > > > write speed through rbd from small kvm instance running on one of
> > > > > first two machines - when bench gave me about 110Mb/s, writing zeros
> > > > > to raw block device inside vm with dd was at top speed about 45 mb/s,
> > > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> > > > > Things get worse, when I`ve started second vm at second host and tried
> > > > > to continue same dd tests simultaneously - performance fairly divided
> > > > > by half for each instance :). Enabling jumbo frames, playing with cpu
> > > > > affinity for ceph and vm instances and trying different TCP congestion
> > > > > protocols gave no effect at all - with DCTCP I have slightly smoother
> > > > > network load graph and that`s all.
> > > > >  
> > > > > Can ml please suggest anything to try to improve performance?
> > > >  
> > > > Can you try setting
> > > >  
> > > > rbd writeback window = 8192000
> > > >  
> > > > or similar, and see what kind of effect that has? I suspect it'll speed
> > > > up dd; I'm less sure about ext3.
> > > >  
> > > > Thanks!
> > > > sage
> > > >  
> > > >  
> > > > >  
> > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> 

Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:
dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
 Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec

Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
 wrote:
> It sounds like maybe you're using Xen? The "rbd writeback window" option only 
> works for userspace rbd implementations (eg, KVM).
> If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
> (~8MB).
>
> What options are you running dd with? If you run a rados bench from both 
> machines, what do the results look like?
> Also, can you do the ceph osd bench on each of your OSDs, please? 
> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
> -Greg
>
>
> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>
>> More strangely, writing speed drops down by fifteen percent when this
>> option was set in vm` config(instead of result from
>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>> under heavy load.
>>
>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil > (mailto:s...@newdream.net)> wrote:
>> > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>> > > Hi,
>> > >
>> > > I`ve did some performance tests at the following configuration:
>> > >
>> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> > > disks on each r410 arranged into raid0 and holds osd data when fourth
>> > > holds os and osd` journal partition, all ceph-related stuff mounted on
>> > > the ext4 without barriers.
>> > >
>> > > Firstly, I`ve noticed about a difference of benchmark performance and
>> > > write speed through rbd from small kvm instance running on one of
>> > > first two machines - when bench gave me about 110Mb/s, writing zeros
>> > > to raw block device inside vm with dd was at top speed about 45 mb/s,
>> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> > > Things get worse, when I`ve started second vm at second host and tried
>> > > to continue same dd tests simultaneously - performance fairly divided
>> > > by half for each instance :). Enabling jumbo frames, playing with cpu
>> > > affinity for ceph and vm instances and trying different TCP congestion
>> > > protocols gave no effect at all - with DCTCP I have slightly smoother
>> > > network load graph and that`s all.
>> > >
>> > > Can ml please suggest anything to try to improve performance?
>> >
>> > Can you try setting
>> >
>> > rbd writeback window = 8192000
>> >
>> > or similar, and see what kind of effect that has? I suspect it'll speed
>> > up dd; I'm less sure about ext3.
>> >
>> > Thanks!
>> > sage
>> >
>> >
>> > >
>> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majord...@vger.kernel.org 
>> > > (mailto:majord...@vger.kernel.org)
>> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org 
>> (mailto:majord...@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-19 Thread Greg Farnum
It sounds like maybe you're using Xen? The "rbd writeback window" option only 
works for userspace rbd implementations (eg, KVM). 
If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 
(~8MB).

What options are you running dd with? If you run a rados bench from both 
machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? 
(http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

> More strangely, writing speed drops down by fifteen percent when this
> option was set in vm` config(instead of result from
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> under heavy load.
> 
> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil  (mailto:s...@newdream.net)> wrote:
> > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> > > Hi,
> > > 
> > > I`ve did some performance tests at the following configuration:
> > > 
> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
> > > disks on each r410 arranged into raid0 and holds osd data when fourth
> > > holds os and osd` journal partition, all ceph-related stuff mounted on
> > > the ext4 without barriers.
> > > 
> > > Firstly, I`ve noticed about a difference of benchmark performance and
> > > write speed through rbd from small kvm instance running on one of
> > > first two machines - when bench gave me about 110Mb/s, writing zeros
> > > to raw block device inside vm with dd was at top speed about 45 mb/s,
> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> > > Things get worse, when I`ve started second vm at second host and tried
> > > to continue same dd tests simultaneously - performance fairly divided
> > > by half for each instance :). Enabling jumbo frames, playing with cpu
> > > affinity for ceph and vm instances and trying different TCP congestion
> > > protocols gave no effect at all - with DCTCP I have slightly smoother
> > > network load graph and that`s all.
> > > 
> > > Can ml please suggest anything to try to improve performance?
> > 
> > Can you try setting
> > 
> > rbd writeback window = 8192000
> > 
> > or similar, and see what kind of effect that has? I suspect it'll speed
> > up dd; I'm less sure about ext3.
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org 
> > > (mailto:majord...@vger.kernel.org)
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-19 Thread Andrey Korolyov
More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil  wrote:
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all ceph-related stuff mounted on
>> the ext4 without barriers.
>>
>> Firstly, I`ve noticed about a difference of benchmark performance and
>> write speed through rbd from small kvm instance running on one of
>> first two machines - when bench gave me about 110Mb/s, writing zeros
>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> Things get worse, when I`ve started second vm at second host and tried
>> to continue same dd tests simultaneously - performance fairly divided
>> by half for each instance :). Enabling jumbo frames, playing with cpu
>> affinity for ceph and vm instances and trying different TCP congestion
>> protocols gave no effect at all - with DCTCP I have slightly smoother
>> network load graph and that`s all.
>>
>> Can ml please suggest anything to try to improve performance?
>
> Can you try setting
>
>        rbd writeback window = 8192000
>
> or similar, and see what kind of effect that has?  I suspect it'll speed
> up dd; I'm less sure about ext3.
>
> Thanks!
> sage
>
>
>>
>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously poor write performance

2012-03-18 Thread Sage Weil
On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> Hi,
> 
> I`ve did some performance tests at the following configuration:
> 
> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> dom0 with three dedicated cores and 1.5G, mostly idle. First three
> disks on each r410 arranged into raid0 and holds osd data when fourth
> holds os and osd` journal partition, all ceph-related stuff mounted on
> the ext4 without barriers.
> 
> Firstly, I`ve noticed about a difference of benchmark performance and
> write speed through rbd from small kvm instance running on one of
> first two machines - when bench gave me about 110Mb/s, writing zeros
> to raw block device inside vm with dd was at top speed about 45 mb/s,
> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> Things get worse, when I`ve started second vm at second host and tried
> to continue same dd tests simultaneously - performance fairly divided
> by half for each instance :). Enabling jumbo frames, playing with cpu
> affinity for ceph and vm instances and trying different TCP congestion
> protocols gave no effect at all - with DCTCP I have slightly smoother
> network load graph and that`s all.
> 
> Can ml please suggest anything to try to improve performance?

Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has?  I suspect it'll speed 
up dd; I'm less sure about ext3.

Thanks!
sage


> 
> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mysteriously poor write performance

2012-03-17 Thread Andrey Korolyov
Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?

ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html