Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Gregory Farnum
On Thu, Aug 22, 2013 at 5:23 PM, Greg Poirier greg.poir...@opower.com wrote:
 On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum g...@inktank.com wrote:

 You don't appear to have accounted for the 2x replication (where all
 writes go to two OSDs) in these calculations. I assume your pools have


 Ah. Right. So I should then be looking at:

 # OSDs * Throughput per disk / 2 / repl factor ?

 Which makes 300-400 MB/s aggregate throughput actually sort of reasonable.


 size 2 (or 3?) for these tests. 3 would explain the performance
 difference entirely; 2x replication leaves it still a bit low but
 takes the difference down to ~350/600 instead of ~350/1200. :)


 Yeah. We're doing 2x repl now, and haven't yet made the decision if we're
 going to move to 3x repl or not.


 You mentioned that your average osd bench throughput was ~50MB/s;
 what's the range?


 41.9 - 54.7 MB/s

 The actual average is 47.1 MB/s

Okay. It's important to realize that because Ceph distributes data
pseudorandomly, each OSD is going to end up with about the same amount
of data going to it. If one of your drives is slower than the others,
the fast ones can get backed up waiting on the slow one to acknowledge
writes, so they end up impacting the cluster throughput a
disproportionate amount. :(

Anyway, I'm guessing you have 24 OSDs from your math earlier?
47MB/s * 24 / 2 = 564MB/s
41MB/s * 24 / 2 = 492MB/s

So taking out or reducing the weight on the slow ones might improve
things a little. But that's still quite a ways off from what you're
seeing — there are a lot of things that could be impacting this but
there's probably something fairly obvious with that much of a gap.
What is the exact benchmark you're running? What do your nodes look like?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Greg Poirier
Ah thanks, Brian. I will do that. I was going off the wiki instructions on
performing rados benchmarks. If I have the time later, I will change it
there.


On Fri, Aug 23, 2013 at 9:37 AM, Brian Andrus brian.and...@inktank.comwrote:

 Hi Greg,


 I haven't had any luck with the seq bench. It just errors every time.


 Can you confirm you are using the --no-cleanup flag with rados write? This
 will ensure there is actually data to read for subsequent seq tests.

 ~Brian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Greg Poirier
On Fri, Aug 23, 2013 at 9:53 AM, Gregory Farnum g...@inktank.com wrote:


 Okay. It's important to realize that because Ceph distributes data
 pseudorandomly, each OSD is going to end up with about the same amount
 of data going to it. If one of your drives is slower than the others,
 the fast ones can get backed up waiting on the slow one to acknowledge
 writes, so they end up impacting the cluster throughput a
 disproportionate amount. :(

 Anyway, I'm guessing you have 24 OSDs from your math earlier?
 47MB/s * 24 / 2 = 564MB/s
 41MB/s * 24 / 2 = 492MB/s


33 OSDs and 3 hosts in the cluster.


 So taking out or reducing the weight on the slow ones might improve
 things a little. But that's still quite a ways off from what you're
 seeing — there are a lot of things that could be impacting this but
 there's probably something fairly obvious with that much of a gap.
 What is the exact benchmark you're running? What do your nodes look like?


The write benchmark I am running is Fio with the following configuration:

  ioengine: libaio
  iodepth: 16
  runtime: 180
  numjobs: 16
  - name: 128k-500M-write
description: 128K block 500M write
bs: 128K
size: 500M
rw: write

Sorry for the weird yaml formatting but I'm copying it from the config file
of my automation stuff.

I run that on powers of 2 VMs up to 32. Each VM is qemu-kvm with a 50 GB
RBD-backed Cinder volume attached. They are 2 VCPU, 4 GB RAM VMs.

The host machines are Dell C6220s, 16-core, hyperthreaded VMs, 128 GB RAM,
with bonded 10 Gbps NICs (mode 4, 20 Gbps throughput -- tested and verified
that's working correctly). There are 2 host machines with 16 VMs each.

The Ceph cluster is made up of Dell C6220s, same NIC setup, 256 GB RAM,
same CPU, 12 disks each (one for os, 11 for OSDs).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Oliver Daudey
Hey Greg,

I encountered a similar problem and we're just in the process of
tracking it down here on the list.  Try downgrading your OSD-binaries to
0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
you're probably experiencing the same problem I have with Dumpling.

PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
start with a database that has been touched by a Dumpling-monitor and
don't talk to them, either.

PPS: I've also had OSDs no longer start with an assert while processing
the journal during these upgrade/downgrade-tests, mostly when coming
down from Dumpling to Cuttlefish.  If you encounter those, delete your
journal and re-create with `ceph-osd -i OSD-ID --mkjournal'.  Your
data-store will be OK, as far as I can tell.


   Regards,

 Oliver

On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
 I have been benchmarking our Ceph installation for the last week or
 so, and I've come across an issue that I'm having some difficulty
 with.
 
 
 Ceph bench reports reasonable write throughput at the OSD level:
 
 
 ceph tell osd.0 bench
 { bytes_written: 1073741824,
   blocksize: 4194304,
   bytes_per_sec: 47288267.00}
 
 
 Running this across all OSDs produces on average 50-55 MB/s, which is
 fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
 on same disk, separate partitions).
 
 
 What I wasn't expecting was the following:
 
 
 I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
 against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
 
 
 1  196.013671875
 2  285.8759765625
 4  351.9169921875
 8  386.455078125
 16 363.8583984375
 24 353.6298828125
 32 348.9697265625
 
 
 
 I was hoping to see something closer to # OSDs * Average value for
 ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
 
 
 We're seeing excellent read, randread performance, but writes are a
 bit of a bother.
 
 
 Does anyone have any suggestions?
 
 
 We have 20 Gb/s network
 I used Fio w/ 16 thread concurrency
 We're running Scientific Linux 6.4
 2.6.32 kernel
 Ceph Dumpling 0.67.1-0.el6
 OpenStack Grizzly
 Libvirt 0.10.2
 qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish
 
 (I'm using qemu-kvm from the ceph-extras repository, which doesn't
 appear to have a -.dumpling version yet).
 
 
 Thanks very much for any assistance.
 
 
 Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Gregory Farnum
On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey oli...@xs4all.nl wrote:
 Hey Greg,

 I encountered a similar problem and we're just in the process of
 tracking it down here on the list.  Try downgrading your OSD-binaries to
 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
 you're probably experiencing the same problem I have with Dumpling.

 PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
 start with a database that has been touched by a Dumpling-monitor and
 don't talk to them, either.

 PPS: I've also had OSDs no longer start with an assert while processing
 the journal during these upgrade/downgrade-tests, mostly when coming
 down from Dumpling to Cuttlefish.  If you encounter those, delete your
 journal and re-create with `ceph-osd -i OSD-ID --mkjournal'.  Your
 data-store will be OK, as far as I can tell.

Careful — deleting the journal is potentially throwing away updates to
your data store! If this is a problem you should flush the journal
with the dumpling binary before downgrading.



Regards,

  Oliver

 On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
 I have been benchmarking our Ceph installation for the last week or
 so, and I've come across an issue that I'm having some difficulty
 with.


 Ceph bench reports reasonable write throughput at the OSD level:


 ceph tell osd.0 bench
 { bytes_written: 1073741824,
   blocksize: 4194304,
   bytes_per_sec: 47288267.00}


 Running this across all OSDs produces on average 50-55 MB/s, which is
 fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
 on same disk, separate partitions).


 What I wasn't expecting was the following:


 I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
 against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:


 1  196.013671875
 2  285.8759765625
 4  351.9169921875
 8  386.455078125
 16 363.8583984375
 24 353.6298828125
 32 348.9697265625



 I was hoping to see something closer to # OSDs * Average value for
 ceph bench (approximately 1.2 GB/s peak aggregate write throughput).


 We're seeing excellent read, randread performance, but writes are a
 bit of a bother.


 Does anyone have any suggestions?
You don't appear to have accounted for the 2x replication (where all
writes go to two OSDs) in these calculations. I assume your pools have
size 2 (or 3?) for these tests. 3 would explain the performance
difference entirely; 2x replication leaves it still a bit low but
takes the difference down to ~350/600 instead of ~350/1200. :)
You mentioned that your average osd bench throughput was ~50MB/s;
what's the range? Have you run any rados bench tests? What is your PG
count across the cluster?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Oliver Daudey
Hey Greg,

Thanks for the tip!  I was assuming a clean shutdown of the OSD should
flush the journal for you and have the OSD try to exit with it's
data-store in a clean state?  Otherwise, I would first have to stop
updates a that particular OSD, then flush the journal, then stop it?


   Regards,

  Oliver

On do, 2013-08-22 at 14:34 -0700, Gregory Farnum wrote:
 On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey oli...@xs4all.nl wrote:
  Hey Greg,
 
  I encountered a similar problem and we're just in the process of
  tracking it down here on the list.  Try downgrading your OSD-binaries to
  0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
  you're probably experiencing the same problem I have with Dumpling.
 
  PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
  start with a database that has been touched by a Dumpling-monitor and
  don't talk to them, either.
 
  PPS: I've also had OSDs no longer start with an assert while processing
  the journal during these upgrade/downgrade-tests, mostly when coming
  down from Dumpling to Cuttlefish.  If you encounter those, delete your
  journal and re-create with `ceph-osd -i OSD-ID --mkjournal'.  Your
  data-store will be OK, as far as I can tell.
 
 Careful — deleting the journal is potentially throwing away updates to
 your data store! If this is a problem you should flush the journal
 with the dumpling binary before downgrading.
 
 
 
 Regards,
 
   Oliver
 
  On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
  I have been benchmarking our Ceph installation for the last week or
  so, and I've come across an issue that I'm having some difficulty
  with.
 
 
  Ceph bench reports reasonable write throughput at the OSD level:
 
 
  ceph tell osd.0 bench
  { bytes_written: 1073741824,
blocksize: 4194304,
bytes_per_sec: 47288267.00}
 
 
  Running this across all OSDs produces on average 50-55 MB/s, which is
  fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
  on same disk, separate partitions).
 
 
  What I wasn't expecting was the following:
 
 
  I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
  against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
 
 
  1  196.013671875
  2  285.8759765625
  4  351.9169921875
  8  386.455078125
  16 363.8583984375
  24 353.6298828125
  32 348.9697265625
 
 
 
  I was hoping to see something closer to # OSDs * Average value for
  ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
 
 
  We're seeing excellent read, randread performance, but writes are a
  bit of a bother.
 
 
  Does anyone have any suggestions?
 You don't appear to have accounted for the 2x replication (where all
 writes go to two OSDs) in these calculations. I assume your pools have
 size 2 (or 3?) for these tests. 3 would explain the performance
 difference entirely; 2x replication leaves it still a bit low but
 takes the difference down to ~350/600 instead of ~350/1200. :)
 You mentioned that your average osd bench throughput was ~50MB/s;
 what's the range? Have you run any rados bench tests? What is your PG
 count across the cluster?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Oliver Daudey
Hey Greg,

I didn't know that option, but I'm always careful to downgrade and
upgrade the OSDs one by one and wait for the cluster to report healthy
again before proceeding to the next, so, as you said, chances of losing
data should have been minimal.  Will flush the journals too next time.
Thanks!


   Regards,

 Oliver

On do, 2013-08-22 at 14:52 -0700, Gregory Farnum wrote:
 On Thu, Aug 22, 2013 at 2:47 PM, Oliver Daudey oli...@xs4all.nl wrote:
  Hey Greg,
 
  Thanks for the tip!  I was assuming a clean shutdown of the OSD should
  flush the journal for you and have the OSD try to exit with it's
  data-store in a clean state?  Otherwise, I would first have to stop
  updates a that particular OSD, then flush the journal, then stop it?
 
 Nope, clean shutdown doesn't force a flush as it could potentially
 block on the filesystem. --flush-journal is a CLI option, so you would
 turn off the OSD, then run it with that option (it won't join the
 cluster or anything, just look at and update local disk state), then
 downgrade the binary.
 In all likelihood this won't have caused you to lose any data because
 in many/most situations the OSD actually will have written out
 everything in the journal to the local FS before you tell it to shut
 down, and as long as one of the other OSDs either did that or turned
 back on without crashing then it will propagate the newer updates to
 everybody. But wiping the journal without flushing is certainly not
 the sort of thing you should get in the habit of doing.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum g...@inktank.com wrote:

 You don't appear to have accounted for the 2x replication (where all
  writes go to two OSDs) in these calculations. I assume your pools have


Ah. Right. So I should then be looking at:

# OSDs * Throughput per disk / 2 / repl factor ?

Which makes 300-400 MB/s aggregate throughput actually sort of reasonable.


 size 2 (or 3?) for these tests. 3 would explain the performance
 difference entirely; 2x replication leaves it still a bit low but
 takes the difference down to ~350/600 instead of ~350/1200. :)


Yeah. We're doing 2x repl now, and haven't yet made the decision if we're
going to move to 3x repl or not.


 You mentioned that your average osd bench throughput was ~50MB/s;
 what's the range?


41.9 - 54.7 MB/s

The actual average is 47.1 MB/s


 Have you run any rados bench tests?


Yessir.

rados bench write:

2013-08-23 00:18:51.933594min lat: 0.071682 max lat: 1.77006 avg lat:
0.196411
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   900  14 73322 73308   325.764   316   0.13978  0.196411
 Total time run: 900.239317
Total writes made:  73322
Write size: 4194304
Bandwidth (MB/sec): 325.789

Stddev Bandwidth:   35.102
Max bandwidth (MB/sec): 440
Min bandwidth (MB/sec): 0
Average Latency:0.196436
Stddev Latency: 0.121463
Max latency:1.77006
Min latency:0.071682

I haven't had any luck with the seq bench. It just errors every time.



 What is your PG count across the cluster?


pgmap v18263: 1650 pgs: 1650 active+clean; 946 GB data, 1894 GB used,
28523 GB / 30417 GB avail; 498MB/s wr, 124op/s

Thanks again.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com