speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG
Hello list, i've made some further testing and have the problem that ceph doesn't scale for me. I added a 4th osd server to my existing 3 node osd cluster. I also reformated all to be able to start with a clean system. While doing random 4k writes from two VMs i see about 8% idle on the osd

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Alexandre DERUMIER
I see something strange with my tests: 3 nodes (8cores E5420 @ 2.50GHz) , 5 osd (xfs) by node with 15k drives, journal on tmpfs kvm guest, with cache=writeback or cache=none (same result): random write test with 4k block: 5000iop/s , cpu idle 20% sequential write test with 4k block: 2iop/

Re: OSD Hardware questions

2012-06-29 Thread Mark Nelson
On 6/28/12 4:25 PM, Stefan Priebe wrote: Am 28.06.2012 17:33, schrieb Sage Weil: Have you tried adjusting 'osd op threads'? The default is 2, but bumping that to, say, 8, might give you better concurrency and throughput. For me this doesn't change anything. I believe the ceph-osd processes are

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Mark Nelson
On 6/29/12 5:46 AM, Stefan Priebe - Profihost AG wrote: Hello list, i've made some further testing and have the problem that ceph doesn't scale for me. I added a 4th osd server to my existing 3 node osd cluster. I also reformated all to be able to start with a clean system. While doing random 4

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG
Some more testing / results: = lowering CPU cores = 1.) disabling CPUs via echo 0 >/sys/devices/system/cpu/cpuX/online for core 4-7 does not change anything 2.) When only 50% of the CPUs are available each ceph-osd process takes only half of the CPU load they use when all are useable.

Re: OSD Hardware questions

2012-06-29 Thread Stefan Priebe - Profihost AG
Am 29.06.2012 13:37, schrieb Mark Nelson: On 6/28/12 4:25 PM, Stefan Priebe wrote: Am 28.06.2012 17:33, schrieb Sage Weil: Have you tried adjusting 'osd op threads'? The default is 2, but bumping that to, say, 8, might give you better concurrency and throughput. For me this doesn't change any

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG
Am 29.06.2012 13:49, schrieb Mark Nelson: I'll try to replicate your findings in house. I've got some other things I have to do today, but hopefully I can take a look next week. If I recall correctly, in the other thread you said that sequential writes are using much less CPU time on your system

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG
Another BIG hint. While doing random 4k I/O from one VM i archieve 14k I/Os. This is around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and 750MB/s. What do they write?!?! Just an idea?: Do they completely rewrite EACH 4MB block for each 4k write? Stefan Am 29.06.2012 15:02

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG
Big sorry. ceph was scrubbing during my last test. Didn't recognized this. When i redo the test i see writes between 20MB/s and 100Mb/s. That is OK. Sorry. Stefan Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG: Another BIG hint. While doing random 4k I/O from one VM i archieve 14

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG
iostat output via iostat -x -t 5 while 4k random writes 06/29/2012 03:20:55 PM avg-cpu: %user %nice %system %iowait %steal %idle 31,630,00 52,640,780,00 14,95 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %u

Re: Interesting results

2012-06-29 Thread Jim Schutt
On 06/28/2012 04:53 PM, Mark Nelson wrote: On 06/28/2012 05:37 PM, Jim Schutt wrote: Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using a patch (https://lkml.org/lkml/2012/6/28/446) on top of 2.5-rc4 for my OSD servers, the same kernel for m

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Sage Weil
On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote: > Am 29.06.2012 13:49, schrieb Mark Nelson: > > I'll try to replicate your findings in house. I've got some other > > things I have to do today, but hopefully I can take a look next week. If > > I recall correctly, in the other thread you sa

RBD support for primary storage in Apache CloudStack

2012-06-29 Thread Wido den Hollander
Hi, I'm cross-posting this to the ceph-devel list since there might be people around here running CloudStack and are interested in this. After a couple of months worth of work I'm happy to announce that the RBD support for primary storage in CloudStack seems to be reaching a point where it's

Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Thu, May 17, 2012 at 2:27 PM, Gregory Farnum wrote: > Sorry this got left for so long... > > On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG > wrote: >> Hi, >> >> the "Designing a cluster guide" >> http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it >> still leave

Re: Designing a cluster guide

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 11:07 AM, Gregory Farnum wrote: >>> the "Designing a cluster guide" >>> http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it >>> still leaves some questions unanswered. Oh, thank you. I've been poking through the Ceph docs, but somehow had not managed to tu

Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Fri, Jun 29, 2012 at 11:42 AM, Brian Edmonds wrote: > What are the likely and worst case scenarios if the OSD journal were > to simply be on a garden variety ramdisk, no battery backing?  In the > case of a single node losing power, and thus losing some data, surely > Ceph can recognize this, a

Re: ceph performance under xen?

2012-06-29 Thread Gregory Farnum
On Thu, Jun 28, 2012 at 7:27 AM, Brian Edmonds wrote: > I've installed a little, four node Ceph (0.47.2) cluster using Xen > virtual machines for testing, and when I run bonnie against a (kernel > driver) mount of it, it seems to be somewhat flaky (disturbing log > messages, occasional binary deat

Re: ceph performance under xen?

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 11:55 AM, Gregory Farnum wrote: > So right now you're using the Ceph filesystem, rather than RBD, right? Right, CephFS. I'm actually not even very clear on what RBD is, and how one might use it, but I'm sure I'll understand that in the fullness of time. I came to Ceph fr

Re: Designing a cluster guide

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum wrote: > If you lose a journal, you lose the OSD. Really? Everything? Not just recent commits? I would have hoped it would just come back up in an old state. Replication should have already been taking care of regaining redundancy for the stuff

Re: ceph performance under xen?

2012-06-29 Thread Gregory Farnum
On Fri, Jun 29, 2012 at 1:54 PM, Brian Edmonds wrote: > On Fri, Jun 29, 2012 at 11:55 AM, Gregory Farnum wrote: >> So right now you're using the Ceph filesystem, rather than RBD, right? > > Right, CephFS.  I'm actually not even very clear on what RBD is, and > how one might use it, but I'm sure I

Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Fri, Jun 29, 2012 at 1:59 PM, Brian Edmonds wrote: > On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum wrote: >> If you lose a journal, you lose the OSD. > > Really?  Everything?  Not just recent commits?  I would have hoped it > would just come back up in an old state.  Replication should have

Re: ceph performance under xen?

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 2:06 PM, Gregory Farnum wrote: > Okay, there's two things I'd do here. First, create a cluster that > only has one MDS — the multi-MDS system is significantly less stable. Ok, will do. > Second, you've got 3 monitors doing frequent fsyncs, and 4 OSDs doing > frequent sync

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe
Am 29.06.2012 17:28, schrieb Sage Weil: On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote: Am 29.06.2012 13:49, schrieb Mark Nelson: I'll try to replicate your findings in house. I've got some other things I have to do today, but hopefully I can take a look next week. If I recall correct

Re: Designing a cluster guide

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum wrote: > Well, actually this depends on the filesystem you're using. With > btrfs, the OSD will roll back to a consistent state, but you don't > know how out-of-date that state is. Ok, so assuming btrfs, then a single machine failure with a ramdisk

Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Fri, Jun 29, 2012 at 2:18 PM, Brian Edmonds wrote: > On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum wrote: >> Well, actually this depends on the filesystem you're using. With >> btrfs, the OSD will roll back to a consistent state, but you don't >> know how out-of-date that state is. > > Ok, s

Re: Designing a cluster guide

2012-06-29 Thread Sage Weil
On Fri, 29 Jun 2012, Brian Edmonds wrote: > On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum wrote: > > Well, actually this depends on the filesystem you're using. With > > btrfs, the OSD will roll back to a consistent state, but you don't > > know how out-of-date that state is. > > Ok, so assumin