Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Christian Balzer
On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote:

> Stupid question : Is your areca 4GB cache shared between ssd journal and
> osd ?
> 
Not a stupid question. 
I made that mistake about 3 years ago in a DRBD setup, OS and activity log
SSDs on the same controller as the storage disks.

> or only use by osds ?
>
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem to
be under any particular load or pressure (utilization) according iostat
and atop during the tests.

Christian
  
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: ceph-users@lists.ceph.com 
> Envoyé: Jeudi 8 Mai 2014 08:26:33 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices 
> 
> 
> Hello, 
> 
> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> 
> > Oh, I didn't notice that. I bet you aren't getting the expected 
> > throughput on the RAID array with OSD access patterns, and that's 
> > applying back pressure on the journal. 
> > 
> I doubt that based on what I see in terms of local performance and
> actual utilization figures according to iostat and atop during the
> tests. 
> 
> But if that were to be true, how would one see if that's the case, as in 
> where in the plethora of data from: 
> 
> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump 
> 
> is the data I'd be looking for? 
> 
> > When I suggested other tests, I meant with and without Ceph. One 
> > particular one is OSD bench. That should be interesting to try at a 
> > variety of block sizes. You could also try runnin RADOS bench and 
> > smalliobench at a few different sizes. 
> > 
> I already did the local tests, as in w/o Ceph, see the original mail
> below. 
> 
> And you might recall me doing rados benches as well in another thread 2 
> weeks ago or so. 
> 
> In either case, osd benching gives me: 
> --- 
> # time ceph tell osd.0 bench 
> { "bytes_written": 1073741824, 
> "blocksize": 4194304, 
> "bytes_per_sec": "247102026.00"} 
> 
> 
> real 0m4.483s 
> --- 
> This is quite a bit slower than this particular SSD (200GB DC 3700)
> should be able to write, but I will let that slide. 
> Note that it is the journal SSD that gets under pressure here (nearly
> 900% util) while the OSD is bored at around 15%. Which is no surprise,
> as it can write data at up to 1600MB/s. 
> 
> at 4k blocks we see: 
> --- 
> # time ceph tell osd.0 bench 1073741824 4096 
> { "bytes_written": 1073741824, 
> "blocksize": 4096, 
> "bytes_per_sec": "9004316.00"} 
> 
> 
> real 1m59.368s 
> --- 
> Here we get a more balanced picture between journal and storage 
> utilization, hovering around 40-50%. 
> So clearly not overtaxing either component. 
> But yet, this looks like 2100 IOPS to me, if my math is half right. 
> 
> Rados at 4k gives us this: 
> --- 
> Total time run: 30.912786 
> Total writes made: 44490 
> Write size: 4096 
> Bandwidth (MB/sec): 5.622 
> 
> Stddev Bandwidth: 3.31452 
> Max bandwidth (MB/sec): 9.92578 
> Min bandwidth (MB/sec): 0 
> Average Latency: 0.0444653 
> Stddev Latency: 0.121887 
> Max latency: 2.80917 
> Min latency: 0.001958 
> --- 
> So this is even worse, just about 1500 IOPS. 
> 
> Regards, 
> 
> Christian 
> 
> > -Greg 
> > 
> > On Wednesday, May 7, 2014, Alexandre DERUMIER  
> > wrote: 
> > 
> > > Hi Christian, 
> > > 
> > > Do you have tried without raid6, to have more osd ? 
> > > (how many disks do you have begin the raid6 ?) 
> > > 
> > > 
> > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > 
> > > maybe can you try without --direct=1 
> > > 
> > > and also enable rbd_cache 
> > > 
> > > ceph.conf 
> > > [client] 
> > > rbd cache = true 
> > > 
> > > 
> > > 
> > > 
> > > - Mail original - 
> > > 
> > > De: "Christian Balzer" > 
> > > À: "Gregory Farnum" >, 
> > > ceph-users@lists.ceph.com  
> > > Envoyé: Jeudi 8 Mai 2014 04:49:16 
> > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and 
> > > backing devices 
> > > 
> > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > > 
> > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > > 
> > > wrote: 
> > > > > 
> > > > > Hello, 
> > > > > 
> > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
> > > > > The journals are on (separate) DC 3700s, the actual OSDs are
> > > > > RAID6 behind an Areca 1882 with 4GB of cache. 
> > > > > 
> > > > > Running this fio: 
> > > > > 
> > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > > --iodepth=128 
> > > > > 
> > > > > results in: 
> > > > > 
> > > > > 30k IOPS on the journal SSD (as expected) 
> > > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise 
> > > > > there) 3200 IOPS from a VM using userspace RBD 
> > > > > 2900 IOPS from a host kernelspace mounted RBD 
> > > > > 
> > > > > When running the fio from the VM RBD t

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Alexandre DERUMIER
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?

or only use by osds ?



- Mail original - 

De: "Christian Balzer"  
À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 8 Mai 2014 08:26:33 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 


Hello, 

On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 

> Oh, I didn't notice that. I bet you aren't getting the expected 
> throughput on the RAID array with OSD access patterns, and that's 
> applying back pressure on the journal. 
> 
I doubt that based on what I see in terms of local performance and actual 
utilization figures according to iostat and atop during the tests. 

But if that were to be true, how would one see if that's the case, as in 
where in the plethora of data from: 

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump 

is the data I'd be looking for? 

> When I suggested other tests, I meant with and without Ceph. One 
> particular one is OSD bench. That should be interesting to try at a 
> variety of block sizes. You could also try runnin RADOS bench and 
> smalliobench at a few different sizes. 
> 
I already did the local tests, as in w/o Ceph, see the original mail below. 

And you might recall me doing rados benches as well in another thread 2 
weeks ago or so. 

In either case, osd benching gives me: 
--- 
# time ceph tell osd.0 bench 
{ "bytes_written": 1073741824, 
"blocksize": 4194304, 
"bytes_per_sec": "247102026.00"} 


real 0m4.483s 
--- 
This is quite a bit slower than this particular SSD (200GB DC 3700) should 
be able to write, but I will let that slide. 
Note that it is the journal SSD that gets under pressure here (nearly 900% 
util) while the OSD is bored at around 15%. Which is no surprise, as it 
can write data at up to 1600MB/s. 

at 4k blocks we see: 
--- 
# time ceph tell osd.0 bench 1073741824 4096 
{ "bytes_written": 1073741824, 
"blocksize": 4096, 
"bytes_per_sec": "9004316.00"} 


real 1m59.368s 
--- 
Here we get a more balanced picture between journal and storage 
utilization, hovering around 40-50%. 
So clearly not overtaxing either component. 
But yet, this looks like 2100 IOPS to me, if my math is half right. 

Rados at 4k gives us this: 
--- 
Total time run: 30.912786 
Total writes made: 44490 
Write size: 4096 
Bandwidth (MB/sec): 5.622 

Stddev Bandwidth: 3.31452 
Max bandwidth (MB/sec): 9.92578 
Min bandwidth (MB/sec): 0 
Average Latency: 0.0444653 
Stddev Latency: 0.121887 
Max latency: 2.80917 
Min latency: 0.001958 
--- 
So this is even worse, just about 1500 IOPS. 

Regards, 

Christian 

> -Greg 
> 
> On Wednesday, May 7, 2014, Alexandre DERUMIER  
> wrote: 
> 
> > Hi Christian, 
> > 
> > Do you have tried without raid6, to have more osd ? 
> > (how many disks do you have begin the raid6 ?) 
> > 
> > 
> > Aslo, I known that direct ios can be quite slow with ceph, 
> > 
> > maybe can you try without --direct=1 
> > 
> > and also enable rbd_cache 
> > 
> > ceph.conf 
> > [client] 
> > rbd cache = true 
> > 
> > 
> > 
> > 
> > - Mail original - 
> > 
> > De: "Christian Balzer" > 
> > À: "Gregory Farnum" >, 
> > ceph-users@lists.ceph.com  
> > Envoyé: Jeudi 8 Mai 2014 04:49:16 
> > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and 
> > backing devices 
> > 
> > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > 
> > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > 
> > wrote: 
> > > > 
> > > > Hello, 
> > > > 
> > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The 
> > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 
> > > > behind an Areca 1882 with 4GB of cache. 
> > > > 
> > > > Running this fio: 
> > > > 
> > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > --iodepth=128 
> > > > 
> > > > results in: 
> > > > 
> > > > 30k IOPS on the journal SSD (as expected) 
> > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise 
> > > > there) 3200 IOPS from a VM using userspace RBD 
> > > > 2900 IOPS from a host kernelspace mounted RBD 
> > > > 
> > > > When running the fio from the VM RBD the utilization of the 
> > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% 
> > > > (1500 IOPS after some obvious merging). 
> > > > The OSD processes are quite busy, reading well over 200% on atop, 
> > > > but the system is not CPU or otherwise resource starved at that 
> > > > moment. 
> > > > 
> > > > Running multiple instances of this test from several VMs on 
> > > > different hosts changes nothing, as in the aggregated IOPS for the 
> > > > whole cluster will still be around 3200 IOPS. 
> > > > 
> > > > Now clearly RBD has to deal with latency here, but the network is 
> > > > IPoIB with the associated low latency and the journal SSDs are the 
> > > > (consistently) fasted ones around. 
> > > > 
> > > > I guess what I am wondering ab

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Christian Balzer

Hello,

On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:

> Oh, I didn't notice that. I bet you aren't getting the expected
> throughput on the RAID array with OSD access patterns, and that's
> applying back pressure on the journal.
>
I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

 ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?

> When I suggested other tests, I meant with and without Ceph. One
> particular one is OSD bench. That should be interesting to try at a
> variety of block sizes. You could also try runnin RADOS bench and
> smalliobench at a few different sizes.
>
I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "247102026.00"}


real0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s. 

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
  "blocksize": 4096,
  "bytes_per_sec": "9004316.00"}


real1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%. 
So clearly not overtaxing either component. 
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
 Total time run: 30.912786
Total writes made:  44490
Write size: 4096
Bandwidth (MB/sec): 5.622 

Stddev Bandwidth:   3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency:0.0444653
Stddev Latency: 0.121887
Max latency:2.80917
Min latency:0.001958
--- 
So this is even worse, just about 1500 IOPS. 

Regards,

Christian

> -Greg
> 
> On Wednesday, May 7, 2014, Alexandre DERUMIER 
> wrote:
> 
> > Hi Christian,
> >
> > Do you have tried without raid6, to have more osd ?
> > (how many disks do you have begin the raid6 ?)
> >
> >
> > Aslo, I known that direct ios can be quite slow with ceph,
> >
> > maybe can you try without --direct=1
> >
> > and also enable rbd_cache
> >
> > ceph.conf
> > [client]
> > rbd cache = true
> >
> >
> >
> >
> > - Mail original -
> >
> > De: "Christian Balzer" >
> > À: "Gregory Farnum" >,
> > ceph-users@lists.ceph.com 
> > Envoyé: Jeudi 8 Mai 2014 04:49:16
> > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
> > backing devices
> >
> > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
> >
> > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
> > > >
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6
> > > > behind an Areca 1882 with 4GB of cache.
> > > >
> > > > Running this fio:
> > > >
> > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
> > > > --iodepth=128
> > > >
> > > > results in:
> > > >
> > > > 30k IOPS on the journal SSD (as expected)
> > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise
> > > > there) 3200 IOPS from a VM using userspace RBD
> > > > 2900 IOPS from a host kernelspace mounted RBD
> > > >
> > > > When running the fio from the VM RBD the utilization of the
> > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
> > > > (1500 IOPS after some obvious merging).
> > > > The OSD processes are quite busy, reading well over 200% on atop,
> > > > but the system is not CPU or otherwise resource starved at that
> > > > moment.
> > > >
> > > > Running multiple instances of this test from several VMs on
> > > > different hosts changes nothing, as in the aggregated IOPS for the
> > > > whole cluster will still be around 3200 IOPS.
> > > >
> > > > Now clearly RBD has to deal with latency here, but the network is
> > > > IPoIB with the associated low latency and the journal SSDs are the
> > > > (consistently) fasted ones around.
> > > >
> > > > I guess what I am wondering about is if this is normal and to be
> > > > expected or if not where all that potential performance got lost.
> > >
> > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
> > Yes, but going down to 32 doesn't change things one iota.
> > Also note the multiple instances I mention up there, so that would be
> > 256 IOs at a time

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Christian Balzer

Hello,

On Thu, 08 May 2014 06:33:51 +0200 (CEST) Alexandre DERUMIER wrote:

> Hi Christian,
> 
> Do you have tried without raid6, to have more osd ?
No and that is neither an option nor the reason for any performance issues
here.
If you re-read my original mail it clearly states that the same fio can
achieve 11 IOPS on that raid and that it is not busy at all during the
test.

> (how many disks do you have begin the raid6 ?)
> 
11 per OSD. 
This will affect the amount of sustainable IOPS of course, but in this
test case every last bit should (and does) fit into the caches.

From the RBD client the transaction should be finished once the primary
and secondary OSD for the PG in question have ACK'ed things.

> 
> Aslo, I known that direct ios can be quite slow with ceph,
> 
> maybe can you try without --direct=1 
> 
I can, but that is not the test case here. 
For the record that pushes it to 12k IOPS, with the journal SSDs reaching
about 30% utilization and the actual OSDs up to 5%. 
So much better, but still quite some capacity for improvement.

> and also enable rbd_cache
> 
> ceph.conf
> [client]
> rbd cache = true
> 
I have that set of course, as well as specifically "writeback" for the KVM
instance in question.

Interestingly I see no difference at all with a KVM instance that is set
explicitly to "none", but that's not part of this particular inquiry
either.

Christian
> 
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: "Gregory Farnum" , ceph-users@lists.ceph.com 
> Envoyé: Jeudi 8 Mai 2014 04:49:16 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices 
> 
> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> 
> > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > wrote: 
> > > 
> > > Hello, 
> > > 
> > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The 
> > > journals are on (separate) DC 3700s, the actual OSDs are RAID6
> > > behind an Areca 1882 with 4GB of cache. 
> > > 
> > > Running this fio: 
> > > 
> > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
> > > --iodepth=128 
> > > 
> > > results in: 
> > > 
> > > 30k IOPS on the journal SSD (as expected) 
> > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise 
> > > there) 3200 IOPS from a VM using userspace RBD 
> > > 2900 IOPS from a host kernelspace mounted RBD 
> > > 
> > > When running the fio from the VM RBD the utilization of the journals
> > > is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS
> > > after some obvious merging). 
> > > The OSD processes are quite busy, reading well over 200% on atop,
> > > but the system is not CPU or otherwise resource starved at that
> > > moment. 
> > > 
> > > Running multiple instances of this test from several VMs on
> > > different hosts changes nothing, as in the aggregated IOPS for the
> > > whole cluster will still be around 3200 IOPS. 
> > > 
> > > Now clearly RBD has to deal with latency here, but the network is
> > > IPoIB with the associated low latency and the journal SSDs are the 
> > > (consistently) fasted ones around. 
> > > 
> > > I guess what I am wondering about is if this is normal and to be 
> > > expected or if not where all that potential performance got lost. 
> > 
> > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) 
> Yes, but going down to 32 doesn't change things one iota. 
> Also note the multiple instances I mention up there, so that would be
> 256 IOs at a time, coming from different hosts over different links and 
> nothing changes. 
> 
> > that's about 40ms of latency per op (for userspace RBD), which seems 
> > awfully long. You should check what your client-side objecter settings 
> > are; it might be limiting you to fewer outstanding ops than that. 
> 
> Googling for client-side objecter gives a few hits on ceph devel and
> bugs and nothing at all as far as configuration options are concerned. 
> Care to enlighten me where one can find those? 
> 
> Also note the kernelspace (3.13 if it matters) speed, which is very much 
> in the same (junior league) ballpark. 
> 
> > If 
> > it's available to you, testing with Firefly or even master would be 
> > interesting — there's some performance work that should reduce 
> > latencies. 
> > 
> Not an option, this is going into production next week. 
> 
> > But a well-tuned (or even default-tuned, I thought) Ceph cluster 
> > certainly doesn't require 40ms/op, so you should probably run a wider 
> > array of experiments to try and figure out where it's coming from. 
> 
> I think we can rule out the network, NPtcp gives me: 
> --- 
> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec 
> --- 
> 
> For comparison at about 512KB it reaches maximum throughput and still 
> isn't that laggy: 
> --- 
> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec 
> --- 
> 
> So with the network performing as well as my lengthy exp

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected throughput
on the RAID array with OSD access patterns, and that's applying back
pressure on the journal.

When I suggested other tests, I meant with and without Ceph. One particular
one is OSD bench. That should be interesting to try at a variety of block
sizes. You could also try runnin RADOS bench and smalliobench at a few
different sizes.
-Greg

On Wednesday, May 7, 2014, Alexandre DERUMIER  wrote:

> Hi Christian,
>
> Do you have tried without raid6, to have more osd ?
> (how many disks do you have begin the raid6 ?)
>
>
> Aslo, I known that direct ios can be quite slow with ceph,
>
> maybe can you try without --direct=1
>
> and also enable rbd_cache
>
> ceph.conf
> [client]
> rbd cache = true
>
>
>
>
> - Mail original -
>
> De: "Christian Balzer" >
> À: "Gregory Farnum" >,
> ceph-users@lists.ceph.com 
> Envoyé: Jeudi 8 Mai 2014 04:49:16
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices
>
> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
>
> > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > >
> wrote:
> > >
> > > Hello,
> > >
> > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
> > > an Areca 1882 with 4GB of cache.
> > >
> > > Running this fio:
> > >
> > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
> > >
> > > results in:
> > >
> > > 30k IOPS on the journal SSD (as expected)
> > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise
> > > there) 3200 IOPS from a VM using userspace RBD
> > > 2900 IOPS from a host kernelspace mounted RBD
> > >
> > > When running the fio from the VM RBD the utilization of the journals is
> > > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
> > > some obvious merging).
> > > The OSD processes are quite busy, reading well over 200% on atop, but
> > > the system is not CPU or otherwise resource starved at that moment.
> > >
> > > Running multiple instances of this test from several VMs on different
> > > hosts changes nothing, as in the aggregated IOPS for the whole cluster
> > > will still be around 3200 IOPS.
> > >
> > > Now clearly RBD has to deal with latency here, but the network is IPoIB
> > > with the associated low latency and the journal SSDs are the
> > > (consistently) fasted ones around.
> > >
> > > I guess what I am wondering about is if this is normal and to be
> > > expected or if not where all that potential performance got lost.
> >
> > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
> Yes, but going down to 32 doesn't change things one iota.
> Also note the multiple instances I mention up there, so that would be 256
> IOs at a time, coming from different hosts over different links and
> nothing changes.
>
> > that's about 40ms of latency per op (for userspace RBD), which seems
> > awfully long. You should check what your client-side objecter settings
> > are; it might be limiting you to fewer outstanding ops than that.
>
> Googling for client-side objecter gives a few hits on ceph devel and bugs
> and nothing at all as far as configuration options are concerned.
> Care to enlighten me where one can find those?
>
> Also note the kernelspace (3.13 if it matters) speed, which is very much
> in the same (junior league) ballpark.
>
> > If
> > it's available to you, testing with Firefly or even master would be
> > interesting — there's some performance work that should reduce
> > latencies.
> >
> Not an option, this is going into production next week.
>
> > But a well-tuned (or even default-tuned, I thought) Ceph cluster
> > certainly doesn't require 40ms/op, so you should probably run a wider
> > array of experiments to try and figure out where it's coming from.
>
> I think we can rule out the network, NPtcp gives me:
> ---
> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
> ---
>
> For comparison at about 512KB it reaches maximum throughput and still
> isn't that laggy:
> ---
> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
> ---
>
> So with the network performing as well as my lengthy experience with IPoIB
> led me to believe, what else is there to look at?
> The storage nodes perform just as expected, indicated by the local fio
> tests.
>
> That pretty much leaves only Ceph/RBD to look at and I'm not really sure
> what experiments I should run on that. ^o^
>
> Regards,
>
> Christian
>
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
>
>
> --
> Christian Balzer Network/Systems Engineer
> ch...@gol.com  Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Alexandre DERUMIER
Hi Christian,

Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)


Aslo, I known that direct ios can be quite slow with ceph,

maybe can you try without --direct=1 

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true




- Mail original - 

De: "Christian Balzer"  
À: "Gregory Farnum" , ceph-users@lists.ceph.com 
Envoyé: Jeudi 8 Mai 2014 04:49:16 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 

On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 

> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer  wrote: 
> > 
> > Hello, 
> > 
> > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The 
> > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind 
> > an Areca 1882 with 4GB of cache. 
> > 
> > Running this fio: 
> > 
> > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 
> > 
> > results in: 
> > 
> > 30k IOPS on the journal SSD (as expected) 
> > 110k IOPS on the OSD (it fits neatly into the cache, no surprise 
> > there) 3200 IOPS from a VM using userspace RBD 
> > 2900 IOPS from a host kernelspace mounted RBD 
> > 
> > When running the fio from the VM RBD the utilization of the journals is 
> > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after 
> > some obvious merging). 
> > The OSD processes are quite busy, reading well over 200% on atop, but 
> > the system is not CPU or otherwise resource starved at that moment. 
> > 
> > Running multiple instances of this test from several VMs on different 
> > hosts changes nothing, as in the aggregated IOPS for the whole cluster 
> > will still be around 3200 IOPS. 
> > 
> > Now clearly RBD has to deal with latency here, but the network is IPoIB 
> > with the associated low latency and the journal SSDs are the 
> > (consistently) fasted ones around. 
> > 
> > I guess what I am wondering about is if this is normal and to be 
> > expected or if not where all that potential performance got lost. 
> 
> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) 
Yes, but going down to 32 doesn't change things one iota. 
Also note the multiple instances I mention up there, so that would be 256 
IOs at a time, coming from different hosts over different links and 
nothing changes. 

> that's about 40ms of latency per op (for userspace RBD), which seems 
> awfully long. You should check what your client-side objecter settings 
> are; it might be limiting you to fewer outstanding ops than that. 

Googling for client-side objecter gives a few hits on ceph devel and bugs 
and nothing at all as far as configuration options are concerned. 
Care to enlighten me where one can find those? 

Also note the kernelspace (3.13 if it matters) speed, which is very much 
in the same (junior league) ballpark. 

> If 
> it's available to you, testing with Firefly or even master would be 
> interesting — there's some performance work that should reduce 
> latencies. 
> 
Not an option, this is going into production next week. 

> But a well-tuned (or even default-tuned, I thought) Ceph cluster 
> certainly doesn't require 40ms/op, so you should probably run a wider 
> array of experiments to try and figure out where it's coming from. 

I think we can rule out the network, NPtcp gives me: 
--- 
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec 
--- 

For comparison at about 512KB it reaches maximum throughput and still 
isn't that laggy: 
--- 
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec 
--- 

So with the network performing as well as my lengthy experience with IPoIB 
led me to believe, what else is there to look at? 
The storage nodes perform just as expected, indicated by the local fio 
tests. 

That pretty much leaves only Ceph/RBD to look at and I'm not really sure 
what experiments I should run on that. ^o^ 

Regards, 

Christian 

> -Greg 
> Software Engineer #42 @ http://inktank.com | http://ceph.com 
> 


-- 
Christian Balzer Network/Systems Engineer 
ch...@gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help -Ceph deployment in Single node Like Devstack

2014-05-07 Thread Neil Levine
Loic's micro-osd.sh script is as close to single push button as it gets:

http://dachary.org/?p=2374

Not exactly a production cluster but it at least allows you to start
experimenting on the CLI.

Neil

On Wed, May 7, 2014 at 7:56 PM, Patrick McGarry  wrote:
> Hey,
>
> Sorry for the delay, I have been traveling in Asia.  This question
> should probably go to the ceph-user list (cc'd).
>
> Right now there is no single push-button deployment for Ceph like
> devstack (that I'm aware of)...but we have sever options in terms of
> orchestration and deployment (including out own ceph-deploy featured
> in the doc).
>
> A good place to see the package options is http://ceph.com/get
>
> Sorry I couldn't give you an exact answer, but I think Ceph is pretty
> approachable in terms of deployment for experimentation.  Hope that
> helps.
>
>
>
> Best Regards,
>
> Patrick McGarry
> Director, Community || Inktank
> http://ceph.com  ||  http://inktank.com
> @scuttlemonkey || @ceph || @inktank
>
>
> On Wed, Apr 30, 2014 at 2:05 AM, Pandiyan M  wrote:
>>
>> Hi,
>>
>> I am looking for Ceph simple instalation like devstack ( For opennstack by
>> one package contains all), it should supports for ceph, puppet and run its
>> function as whole ceph does? help me out
>>
>> Thanks in Advance !!
>> --
>> PANDIYAN MUTHURAMAN
>>
>> Mobile : + 91 9600-963-436   (Personal)
>>   +91 7259-031-872  (Official)
>>
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep-Scrub Scheduling

2014-05-07 Thread Aaron Ten Clay
Mike,

You can find the last scrub info for a given PG with "ceph pg x.yy query".

-Aaron


On Wed, May 7, 2014 at 8:47 PM, Mike Dawson wrote:

> Perhaps, but if that were the case, would you expect the max concurrent
> number of deep-scrubs to approach the number of OSDs in the cluster?
>
> I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peak at
> a max of 12. Do pools (two in use) and replication settings (3 copies in
> both pools) factor in?
>
> 72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs
>
> That seems plausible (without looking at the code).
>
> But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unset nodeep-scrub',
> the count of concurrent deep-scrubs doesn't resume the high level, but
> rather stays low seemingly for days at a time, until the next onslaught. If
> driven by the max scrub interval, shouldn't it jump quickly back up?
>
> Is there way to find the last scrub time for a given PG via the CLI to
> know for sure?
>
> Thanks,
> Mike Dawson
>
>
> On 5/7/2014 10:59 PM, Gregory Farnum wrote:
>
>> Is it possible you're running into the max scrub intervals and jumping
>> up to one-per-OSD from a much lower normal rate?
>>
>> On Wednesday, May 7, 2014, Mike Dawson > > wrote:
>>
>> My write-heavy cluster struggles under the additional load created
>> by deep-scrub from time to time. As I have instrumented the cluster
>> more, it has become clear that there is something I cannot explain
>> happening in the scheduling of PGs to undergo deep-scrub.
>>
>> Please refer to these images [0][1] to see two graphical
>> representations of how deep-scrub goes awry in my cluster. These
>> were two separate incidents. Both show a period of "happy" scrub and
>> deep-scrubs and stable writes/second across the cluster, then an
>> approximately 5x jump in concurrent deep-scrubs where client IO is
>> cut by nearly 50%.
>>
>> The first image (deep-scrub-issue1.jpg) shows a happy cluster with
>> low numbers of scrub and deep-scrub running until about 10pm, then
>> something triggers deep-scrubs to increase about 5x and remain high
>> until I manually 'ceph osd set nodeep-scrub' at approx 10am. During
>> the time of higher concurrent deep-scrubs, IOPS drop significantly
>> due to OSD spindle contention preventing qemu/rbd clients from
>> writing like normal.
>>
>> The second image (deep-scrub-issue2.jpg) shows a similar approx 5x
>> jump in concurrent deep-scrubs and associated drop in writes/second.
>> This image also adds a summary of the 'dump historic ops' which show
>> the to be expected jump in the slowest ops in the cluster.
>>
>> Does anyone have an idea of what is happening when the spike in
>> concurrent deep-scrub occurs and how to prevent the adverse effects,
>> outside of disabling deep-scrub permanently?
>>
>> 0: http://www.mikedawson.com/__deep-scrub-issue1.jpg
>> 
>> 1: http://www.mikedawson.com/__deep-scrub-issue2.jpg
>>
>> 
>>
>> Thanks,
>> Mike Dawson
>> _
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>
>> 
>>
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep-Scrub Scheduling

2014-05-07 Thread Mike Dawson
Perhaps, but if that were the case, would you expect the max concurrent 
number of deep-scrubs to approach the number of OSDs in the cluster?


I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peak 
at a max of 12. Do pools (two in use) and replication settings (3 copies 
in both pools) factor in?


72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs

That seems plausible (without looking at the code).

But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unset 
nodeep-scrub', the count of concurrent deep-scrubs doesn't resume the 
high level, but rather stays low seemingly for days at a time, until the 
next onslaught. If driven by the max scrub interval, shouldn't it jump 
quickly back up?


Is there way to find the last scrub time for a given PG via the CLI to 
know for sure?


Thanks,
Mike Dawson

On 5/7/2014 10:59 PM, Gregory Farnum wrote:

Is it possible you're running into the max scrub intervals and jumping
up to one-per-OSD from a much lower normal rate?

On Wednesday, May 7, 2014, Mike Dawson mailto:mike.daw...@cloudapt.com>> wrote:

My write-heavy cluster struggles under the additional load created
by deep-scrub from time to time. As I have instrumented the cluster
more, it has become clear that there is something I cannot explain
happening in the scheduling of PGs to undergo deep-scrub.

Please refer to these images [0][1] to see two graphical
representations of how deep-scrub goes awry in my cluster. These
were two separate incidents. Both show a period of "happy" scrub and
deep-scrubs and stable writes/second across the cluster, then an
approximately 5x jump in concurrent deep-scrubs where client IO is
cut by nearly 50%.

The first image (deep-scrub-issue1.jpg) shows a happy cluster with
low numbers of scrub and deep-scrub running until about 10pm, then
something triggers deep-scrubs to increase about 5x and remain high
until I manually 'ceph osd set nodeep-scrub' at approx 10am. During
the time of higher concurrent deep-scrubs, IOPS drop significantly
due to OSD spindle contention preventing qemu/rbd clients from
writing like normal.

The second image (deep-scrub-issue2.jpg) shows a similar approx 5x
jump in concurrent deep-scrubs and associated drop in writes/second.
This image also adds a summary of the 'dump historic ops' which show
the to be expected jump in the slowest ops in the cluster.

Does anyone have an idea of what is happening when the spike in
concurrent deep-scrub occurs and how to prevent the adverse effects,
outside of disabling deep-scrub permanently?

0: http://www.mikedawson.com/__deep-scrub-issue1.jpg

1: http://www.mikedawson.com/__deep-scrub-issue2.jpg


Thanks,
Mike Dawson
_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com




--
Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep-Scrub Scheduling

2014-05-07 Thread Gregory Farnum
Is it possible you're running into the max scrub intervals and jumping up
to one-per-OSD from a much lower normal rate?

On Wednesday, May 7, 2014, Mike Dawson  wrote:

> My write-heavy cluster struggles under the additional load created by
> deep-scrub from time to time. As I have instrumented the cluster more, it
> has become clear that there is something I cannot explain happening in the
> scheduling of PGs to undergo deep-scrub.
>
> Please refer to these images [0][1] to see two graphical representations
> of how deep-scrub goes awry in my cluster. These were two separate
> incidents. Both show a period of "happy" scrub and deep-scrubs and stable
> writes/second across the cluster, then an approximately 5x jump in
> concurrent deep-scrubs where client IO is cut by nearly 50%.
>
> The first image (deep-scrub-issue1.jpg) shows a happy cluster with low
> numbers of scrub and deep-scrub running until about 10pm, then something
> triggers deep-scrubs to increase about 5x and remain high until I manually
> 'ceph osd set nodeep-scrub' at approx 10am. During the time of higher
> concurrent deep-scrubs, IOPS drop significantly due to OSD spindle
> contention preventing qemu/rbd clients from writing like normal.
>
> The second image (deep-scrub-issue2.jpg) shows a similar approx 5x jump in
> concurrent deep-scrubs and associated drop in writes/second. This image
> also adds a summary of the 'dump historic ops' which show the to be
> expected jump in the slowest ops in the cluster.
>
> Does anyone have an idea of what is happening when the spike in concurrent
> deep-scrub occurs and how to prevent the adverse effects, outside of
> disabling deep-scrub permanently?
>
> 0: http://www.mikedawson.com/deep-scrub-issue1.jpg
> 1: http://www.mikedawson.com/deep-scrub-issue2.jpg
>
> Thanks,
> Mike Dawson
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help -Ceph deployment in Single node Like Devstack

2014-05-07 Thread Patrick McGarry
Hey,

Sorry for the delay, I have been traveling in Asia.  This question
should probably go to the ceph-user list (cc'd).

Right now there is no single push-button deployment for Ceph like
devstack (that I'm aware of)...but we have sever options in terms of
orchestration and deployment (including out own ceph-deploy featured
in the doc).

A good place to see the package options is http://ceph.com/get

Sorry I couldn't give you an exact answer, but I think Ceph is pretty
approachable in terms of deployment for experimentation.  Hope that
helps.



Best Regards,

Patrick McGarry
Director, Community || Inktank
http://ceph.com  ||  http://inktank.com
@scuttlemonkey || @ceph || @inktank


On Wed, Apr 30, 2014 at 2:05 AM, Pandiyan M  wrote:
>
> Hi,
>
> I am looking for Ceph simple instalation like devstack ( For opennstack by
> one package contains all), it should supports for ceph, puppet and run its
> function as whole ceph does? help me out
>
> Thanks in Advance !!
> --
> PANDIYAN MUTHURAMAN
>
> Mobile : + 91 9600-963-436   (Personal)
>   +91  7259-031-872  (Official)
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Christian Balzer
On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:

> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer  wrote:
> >
> > Hello,
> >
> > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
> > an Areca 1882 with 4GB of cache.
> >
> > Running this fio:
> >
> > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
> >
> > results in:
> >
> >   30k  IOPS on the journal SSD (as expected)
> >  110k  IOPS on the OSD (it fits neatly into the cache, no surprise
> > there) 3200   IOPS from a VM using userspace RBD
> > 2900   IOPS from a host kernelspace mounted RBD
> >
> > When running the fio from the VM RBD the utilization of the journals is
> > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
> > some obvious merging).
> > The OSD processes are quite busy, reading well over 200% on atop, but
> > the system is not CPU or otherwise resource starved at that moment.
> >
> > Running multiple instances of this test from several VMs on different
> > hosts changes nothing, as in the aggregated IOPS for the whole cluster
> > will still be around 3200 IOPS.
> >
> > Now clearly RBD has to deal with latency here, but the network is IPoIB
> > with the associated low latency and the journal SSDs are the
> > (consistently) fasted ones around.
> >
> > I guess what I am wondering about is if this is normal and to be
> > expected or if not where all that potential performance got lost.
> 
> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota. 
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.

> that's about 40ms of latency per op (for userspace RBD), which seems
> awfully long. You should check what your client-side objecter settings
> are; it might be limiting you to fewer outstanding ops than that. 

Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned. 
Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.

> If
> it's available to you, testing with Firefly or even master would be
> interesting — there's some performance work that should reduce
> latencies.
> 
Not an option, this is going into production next week.

> But a well-tuned (or even default-tuned, I thought) Ceph cluster
> certainly doesn't require 40ms/op, so you should probably run a wider
> array of experiments to try and figure out where it's coming from.

I think we can rule out the network, NPtcp gives me:
---
 56:4096 bytes   1546 times -->979.22 Mbps in  31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and still
isn't that laggy:
---
 98:  524288 bytes121 times -->   9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^

Regards,

Christian

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Gregory Farnum
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer  wrote:
>
> Hello,
>
> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
> are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
> with 4GB of cache.
>
> Running this fio:
>
> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
> --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
>
> results in:
>
>   30k  IOPS on the journal SSD (as expected)
>  110k  IOPS on the OSD (it fits neatly into the cache, no surprise there)
> 3200   IOPS from a VM using userspace RBD
> 2900   IOPS from a host kernelspace mounted RBD
>
> When running the fio from the VM RBD the utilization of the journals is
> about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
> obvious merging).
> The OSD processes are quite busy, reading well over 200% on atop, but
> the system is not CPU or otherwise resource starved at that moment.
>
> Running multiple instances of this test from several VMs on different hosts
> changes nothing, as in the aggregated IOPS for the whole cluster will
> still be around 3200 IOPS.
>
> Now clearly RBD has to deal with latency here, but the network is IPoIB
> with the associated low latency and the journal SSDs are the
> (consistently) fasted ones around.
>
> I guess what I am wondering about is if this is normal and to be expected
> or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that. If
it's available to you, testing with Firefly or even master would be
interesting — there's some performance work that should reduce
latencies.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deep-Scrub Scheduling

2014-05-07 Thread Mike Dawson
My write-heavy cluster struggles under the additional load created by 
deep-scrub from time to time. As I have instrumented the cluster more, 
it has become clear that there is something I cannot explain happening 
in the scheduling of PGs to undergo deep-scrub.


Please refer to these images [0][1] to see two graphical representations 
of how deep-scrub goes awry in my cluster. These were two separate 
incidents. Both show a period of "happy" scrub and deep-scrubs and 
stable writes/second across the cluster, then an approximately 5x jump 
in concurrent deep-scrubs where client IO is cut by nearly 50%.


The first image (deep-scrub-issue1.jpg) shows a happy cluster with low 
numbers of scrub and deep-scrub running until about 10pm, then something 
triggers deep-scrubs to increase about 5x and remain high until I 
manually 'ceph osd set nodeep-scrub' at approx 10am. During the time of 
higher concurrent deep-scrubs, IOPS drop significantly due to OSD 
spindle contention preventing qemu/rbd clients from writing like normal.


The second image (deep-scrub-issue2.jpg) shows a similar approx 5x jump 
in concurrent deep-scrubs and associated drop in writes/second. This 
image also adds a summary of the 'dump historic ops' which show the to 
be expected jump in the slowest ops in the cluster.


Does anyone have an idea of what is happening when the spike in 
concurrent deep-scrub occurs and how to prevent the adverse effects, 
outside of disabling deep-scrub permanently?


0: http://www.mikedawson.com/deep-scrub-issue1.jpg
1: http://www.mikedawson.com/deep-scrub-issue2.jpg

Thanks,
Mike Dawson
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-07 Thread Christian Balzer

Hello,

ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.

Running this fio:

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4k --iodepth=128

results in:

  30k  IOPS on the journal SSD (as expected)
 110k  IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200   IOPS from a VM using userspace RBD
2900   IOPS from a host kernelspace mounted RBD

When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.

Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.

Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around. 

I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis

On 5/7/14 15:33 , Dimitri Maziuk wrote:

On 05/07/2014 04:11 PM, Craig Lewis wrote:

On 5/7/14 13:40 , Sergey Malinin wrote:

Check dmesg and SMART data on both nodes. This behaviour is similar to
failing hdd.



It does sound like a failing disk... but there's nothing in dmesg, and
smartmontools hasn't emailed me about a failing disk.  The same thing is
happening to more than 50% of my OSDs, in both nodes.

check 'iostat -dmx 5 5' (or some other numbers) -- if you see 100%+ disk
utilization, that could be the dying one.




About an hour after I applied the osd_recovery_max_active=1, things 
settled down.  Looking at the graphs, it looks like most of the OSDs 
crashed one more time, then started working correctly.


Because of the very low recovery parameters, there's on a single 
backfill running.  `iostat -dmx 5 5` did report 100% util on the osd 
that is backfilling, but I expected that.  Once backfilling moves on to 
a new osd, the 100% util follows the backfill operation.



There's a lot of recovery to finish.  Hopefully this will last until it 
completes.  If so, I'm adding osd_recovery_max_active=1 to ceph.conf.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Gilles Mocellin

Le 07/05/2014 15:23, Vlad Gorbunov a écrit :

It's easy to install tgtd with ceph support. ubuntu 12.04 for example:

Connect ceph-extras repo:
echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release 
-sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list


Install tgtd with rbd support:
apt-get update
apt-get install tgt

It's important to disable the rbd cache on tgtd host. Set in 
/etc/ceph/ceph.conf:

[client]
rbd_cache = false

[...]

Hello,

Without cache on the tgtd side, it should be possible to have failover 
and load balancing (active/avtive) multipathing.

Have you tested multipath load balancing in this scenario ?

If it's reliable, it opens a new way for me to do HA storage with iSCSI !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Vladislav Gorbunov
>Should this be done on the iscsi target server? I have a default option to 
>enable rbd caching as it speeds things up on the vms.
Yes, only on the iscsi target servers.

2014-05-08 1:29 GMT+12:00 Andrei Mikhailovsky :
>> It's important to disable the rbd cache on tgtd host. Set in
>> /etc/ceph/ceph.conf:
>
>
> Should this be done on the iscsi target server? I have a default option to
> enable rbd caching as it speeds things up on the vms.
>
> Thanks
>
> Andrei
>
>
>
> 
> From: "Vlad Gorbunov" 
> To: "Sergey Malinin" 
> Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com
> Sent: Wednesday, 7 May, 2014 2:23:52 PM
>
> Subject: Re: [ceph-users] NFS over CEPH - best practice
>
> It's easy to install tgtd with ceph support. ubuntu 12.04 for example:
>
> Connect ceph-extras repo:
> echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release -sc) main
> | sudo tee /etc/apt/sources.list.d/ceph-extras.list
>
> Install tgtd with rbd support:
> apt-get update
> apt-get install tgt
>
> It's important to disable the rbd cache on tgtd host. Set in
> /etc/ceph/ceph.conf:
> [client]
> rbd_cache = false
>
> Define permanent export rbd with iscsi in /etc/tgt/targets.conf:
>
> 
> driver iscsi
> bs-type rbd
> backing-store iscsi/volume512
> initiator-address 10.166.18.87
> 
> service tgt reload
>
> Or use commands:
> tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1
> --backing-store iscsi/volume512 --bstype rbd
> tgtadm -C 0 --lld iscsi --op bind --mode target --tid 1 -I 10.166.18.87
>
> tgt-admin -s
> show current iscsi settings and sessions.
>
>
>
> You can install tgtd on multiple osd/monitor hosts and connect iscsi
> initiator to this servers with multipath enabled.  Iscsi proxy servers not
> needed with tgtd.
>
> On Thu, May 8, 2014 at 12:20 AM, Sergey Malinin, wrote:
>>
>>
>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
>>
>> On Wednesday, May 7, 2014 at 15:06, Andrei Mikhailovsky wrote:
>>
>>
>> Vlad, is there a howto somewhere describing the steps on how to setup
>> iscsi multipathing over ceph? It looks like a good alternative to nfs
>>
>> Thanks
>>
>> 
>> From: "Vlad Gorbunov" 
>> To: "Andrei Mikhailovsky" 
>> Cc: ceph-users@lists.ceph.com
>> Sent: Wednesday, 7 May, 2014 12:02:09 PM
>> Subject: Re: [ceph-users] NFS over CEPH - best practice
>>
>> For XenServer or VMware is better to use iscsi client to tgtd with ceph
>> support. You can install tgtd on osd or monitor server and use multipath for
>> failover.
>>
>> On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky 
>> wrote:
>>
>> Hello guys,
>>
>> I would like to offer NFS service to the XenServer and VMWare hypervisors
>> for storing vm images. I am currently running ceph rbd with kvm, which is
>> working reasonably well.
>>
>> What would be the best way of running NFS services over CEPH, so that the
>> XenServer and VMWare's vm disk images are stored in ceph storage over NFS?
>>
>> Many thanks
>>
>> Andrei
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Dimitri Maziuk
On 05/07/2014 04:11 PM, Craig Lewis wrote:
> On 5/7/14 13:40 , Sergey Malinin wrote:
>> Check dmesg and SMART data on both nodes. This behaviour is similar to
>> failing hdd.
>>
>>
> 
> It does sound like a failing disk... but there's nothing in dmesg, and
> smartmontools hasn't emailed me about a failing disk.  The same thing is
> happening to more than 50% of my OSDs, in both nodes.

check 'iostat -dmx 5 5' (or some other numbers) -- if you see 100%+ disk
utilization, that could be the dying one.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis

On 5/7/14 13:40 , Sergey Malinin wrote:
Check dmesg and SMART data on both nodes. This behaviour is similar to 
failing hdd.





It does sound like a failing disk... but there's nothing in dmesg, and 
smartmontools hasn't emailed me about a failing disk.  The same thing is 
happening to more than 50% of my OSDs, in both nodes.




smartctl for osd.5 says:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000b   100   100   016 Pre-fail  
Always   -   0
  2 Throughput_Performance  0x0005   136   136   054 Pre-fail  
Offline  -   81
  3 Spin_Up_Time0x0007   100   100   024 Pre-fail  
Always   -   606
  4 Start_Stop_Count0x0012   100   100   000 Old_age   
Always   -   5
  5 Reallocated_Sector_Ct   0x0033   100   100   005 Pre-fail  
Always   -   0
  7 Seek_Error_Rate 0x000b   100   100   067 Pre-fail  
Always   -   0
  8 Seek_Time_Performance   0x0005   119   119   020 Pre-fail  
Offline  -   35
  9 Power_On_Hours  0x0012   100   100   000 Old_age   
Always   -   4028
 10 Spin_Retry_Count0x0013   100   100   060 Pre-fail  
Always   -   0
 12 Power_Cycle_Count   0x0032   100   100   000 Old_age   
Always   -   5
192 Power-Off_Retract_Count 0x0032   100   100   000 Old_age   
Always   -   166
193 Load_Cycle_Count0x0012   100   100   000 Old_age   
Always   -   166
194 Temperature_Celsius 0x0002   166   166   000 Old_age   
Always   -   36 (Min/Max 21/39)
196 Reallocated_Event_Count 0x0032   100   100   000 Old_age   
Always   -   0
197 Current_Pending_Sector  0x0022   100   100   000 Old_age   
Always   -   0
198 Offline_Uncorrectable   0x0008   100   100   000 Old_age   
Offline  -   0
199 UDMA_CRC_Error_Count0x000a   200   200   000 Old_age   
Always   -   0


The weekly scheduled tests have all completed successfully:
SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining 
LifeTime(hours)  LBA_of_first_error

# 1  Short offline   Completed without error 00%  3922 -
# 2  Short offline   Completed without error 00%  3754 -
...



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Sergey Malinin
Check dmesg and SMART data on both nodes. This behaviour is similar to failing 
hdd. 


On Wednesday, May 7, 2014 at 23:28, Craig Lewis wrote:

> On 5/7/14 13:15 , Sergey Malinin wrote:
> > Is there anything unusual in dmesg at osd.5? 
> 
> Nothing in dmesg, but ceph-osd.5.log has plenty.  I've attached the log after 
> the restart.  Logging levels are normal.
> 
> What jumps out at me is:
> 2014-05-07 12:48:02.640164 7ff65d439700 -1 osd.5 38870 heartbeat_check: no 
> reply from osd.8 ever on either front or back, first ping sent 2014-05-07 
> 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 2014-05-07 12:48:02.640174 7ff65d439700 -1 osd.5 38870 heartbeat_check: no 
> reply from osd.11 ever on either front or back, first ping sent 2014-05-07 
> 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 2014-05-07 12:48:02.640180 7ff65d439700 -1 osd.5 38870 heartbeat_check: no 
> reply from osd.12 ever on either front or back, first ping sent 2014-05-07 
> 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 2014-05-07 12:48:02.640186 7ff65d439700 -1 osd.5 38870 heartbeat_check: no 
> reply from osd.13 ever on either front or back, first ping sent 2014-05-07 
> 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 
> osd.5 is on host ceph0c.
> osd.8, osd.11, osd.12, and osd.13 are all on ceph1c (along with 4 other 
> OSDs).  Both the front and back network are working fine, and I can connect 
> to osd.8 from host ceph0 just fine.  These 4 OSDs are not reporting problems.
> 
> The other OSDs that are flapping are osd.6 and osd.15.
> 
> osd.15 says:
> 2014-05-07 13:25:44.626840 7fe312c9d700 -1 osd.15 38891 heartbeat_check: no 
> reply from osd.5 since back 2014-05-07 13:10:01.239883 front 2014-05-07 
> 13:10:01.239883 (cutoff 2014-05-07 13:25:24.626838)
> 2014-05-07 13:25:44.626849 7fe312c9d700 -1 osd.15 38891 heartbeat_check: no 
> reply from osd.11 since back 2014-05-07 13:22:48.592121 front 2014-05-07 
> 13:22:48.592121 (cutoff 2014-05-07 13:25:24.626838)
> 
> osd.6 says:
> 2014-05-07 13:26:15.409217 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no 
> reply from osd.5 since back 2014-05-07 13:09:57.440713 front 2014-05-07 
> 13:09:57.440713 (cutoff 2014-05-07 13:25:55.409216)
> 2014-05-07 13:26:15.409227 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no 
> reply from osd.11 since back 2014-05-07 13:22:50.353671 front 2014-05-07 
> 13:22:50.353671 (cutoff 2014-05-07 13:25:55.409216)
> 2014-05-07 13:26:15.409235 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no 
> reply from osd.13 since back 2014-05-07 13:11:26.959761 front 2014-05-07 
> 13:11:26.959761 (cutoff 2014-05-07 13:25:55.409216)
> 2014-05-07 13:26:15.409306 7f35e13e5700  0 -- 10.194.0.6:0/17586 >> 
> 10.194.0.7:6803/19641 pipe(0x1c4d7500 sd=79 :56788 s=1 pgs=0 cs=0 l=1 
> c=0x1c646840).connect claims to be 10.194.0.7:6803/1019705 not 
> 10.194.0.7:6803/19641 - wrong node!
> 
> osd.11 and osd.13 have been kicked out for being unresponsive, but they don't 
> have any heartbeat_check entries in their logs.
> 
> 
> 
> 
> -- 
> Craig Lewis 
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com (mailto:cle...@centraldesktop.com) 
> Central Desktop. Work together in ways you never thought possible. 
> Connect with us   Website (http://www.centraldesktop.com/)  |  Twitter 
> (http://www.twitter.com/centraldesktop)  |  Facebook 
> (http://www.facebook.com/CentralDesktop)  |  LinkedIn 
> (http://www.linkedin.com/groups?gid=147417)  |  Blog 
> (http://cdblog.centraldesktop.com/) 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> Attachments: 
> - ceph-osd.5.log
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Sergey Malinin
Is there anything unusual in dmesg at osd.5?


On Wednesday, May 7, 2014 at 23:09, Craig Lewis wrote:

> I already have osd_max_backfill = 1, and osd_recovery_op_priority = 1.  
> 
> osd_recovery_max_active is the default 15, so I'll give that a try...  some 
> OSDs timed out during the injectargs.  I added it to ceph.conf, and restarted 
> them all.  
> 
> I was running RadosGW-Agent, but it's down now.  I disabled scrub and 
> deep-scrub as well.  All the Disk I/O is dedicated to recovery now.
> 
> 15 minutes after the restart:
> 2014-05-07 13:03:19.249179 mon.0 [INF] osd.5 marked down after no pg stats 
> for 901.601323seconds
> 
> One of the OSDs (osd.5) didn't complete the peering process.  It's like the 
> OSD locked up immediately after restart.  It looks like it too.  As soon as 
> osd.5 started peering, it went to exactly 100% CPU, and other OSDs start 
> complaining that it wasn't responding to subops.
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bulk storage use case

2014-05-07 Thread Cedric Lemarchand
Some more details, the io pattern will be around 90%write 10%read,
mainly sequential.
Recent posts shows that max_backfills, recovery_max_active and
recovery_op_priority settings will be helpful in case of backfilling/re
balancing.

Any thoughts on such hardware setup ?

Le 07/05/2014 11:43, Cedric Lemarchand a écrit :
> Hello,
>
> This build is only intended for archiving purpose, what matter here is
> lowering ratio $/To/W.
> Access to the storage would be via radosgw, installed on each nodes. I
> need that each nodes sustain an average of 1Gb write rates, for which
> I think it would not be a problem. Erasure encoding will be used with
> something like k=12 m=3.
>
> A typical node would be :
>
> - Supermicro 36 bays
> - 2x Xeon E5-2630Lv2
> - 96Go ram (recommended ratio 1Go/To for OSD is lowered a bit ... )
> - HBA LSI adaptaters, JBOD mode, could be 2x 9207-8i
> - 36 HDD 4To with default journals config
> - dedicated bonded 2Gb links for public/private networks (backfilling
> will takes ages if a full node is lost ...)
>
>
> I think in an *optimal* state (ceph healthy), it could handle the job.
> Waiting for your comment.
>
> What is bothering me more is cases of OSD maintenance operations like
> backfilling and cluster re balancing, where nodes will be put under
> very hight IO/memory and CPU load during hours/days. Does the latency
> will *just* grow up, or does everything will fly away ? (OOMK spawn,
> OSD suicides because of latency, node pushed out of the cluster, ect ... )
>
> As you understand I am trying to design the cluster with in mind a
> sweet spot like "things becomes slow, latency grow up, but the node
> stay stable/usable and aren't pushed out of the cluster".
>
> This is my first jump into Ceph, so any inputs will be greatly
> appreciated ;-)
>
> Cheers,
>
> --
> Cédric
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cédric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis

I already have osd_max_backfill = 1, and osd_recovery_op_priority = 1.

osd_recovery_max_active is the default 15, so I'll give that a try...  
some OSDs timed out during the injectargs.  I added it to ceph.conf, and 
restarted them all.


I was running RadosGW-Agent, but it's down now.  I disabled scrub and 
deep-scrub as well.  All the Disk I/O is dedicated to recovery now.


15 minutes after the restart:
2014-05-07 13:03:19.249179 mon.0 [INF] osd.5 marked down after no pg 
stats for 901.601323seconds


One of the OSDs (osd.5) didn't complete the peering process.  It's like 
the OSD locked up immediately after restart.  It looks like it too.  As 
soon as osd.5 started peering, it went to exactly 100% CPU, and other 
OSDs start complaining that it wasn't responding to subops.









On 5/7/14 11:45 , Mike Dawson wrote:

Craig,

I suspect the disks in question are seeking constantly and the spindle 
contention is causing significant latency. A strategy of throttling 
backfill/recovery and reducing client traffic tends to work for me.


1) You should make sure recovery and backfill are throttled:
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'

2) We run a not-particularly critical service with a constant stream 
of 95% write/5% read small, random IO. During recovery/backfill, we 
are heavily bound by IOPS. It often times feels like a net win to 
throttle unessential client traffic in an effort to get spindle 
contention under control if Step 1 wasn't enough.


If that all fails, you can try "ceph osd set nodown" which will 
prevent OSDs from being marked down (with or without proper cause), 
but that tends to cause me more trouble than its worth.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 5/7/2014 1:28 PM, Craig Lewis wrote:

The 5 OSDs that are down have all been kicked out for being
unresponsive.  The 5 OSDs are getting kicked faster than they can
complete the recovery+backfill.  The number of degraded PGs is growing
over time.

root@ceph0c:~# ceph -w
 cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120
  health HEALTH_WARN 49 pgs backfill; 926 pgs degraded; 252 pgs
down; 30 pgs incomplete; 291 pgs peering; 1 pgs recovery_wait; 175 pgs
stale; 255 pgs stuck inactive; 175 pgs stuck stale; 1234 pgs stuck
unclean; 66 requests are blocked > 32 sec; recovery 6820014/3806
objects degraded (17.921%); 4/16 in osds are down; noout flag(s) set
  monmap e2: 2 mons at
{ceph0c=10.193.0.6:6789/0,ceph1c=10.193.0.7:6789/0}, election epoch 238,
quorum 0,1 ceph0c,ceph1c
  osdmap e38673: 16 osds: 12 up, 16 in
 flags noout
   pgmap v7325233: 2560 pgs, 17 pools, 14090 GB data, 18581 kobjects
 28456 GB used, 31132 GB / 59588 GB avail
 6820014/3806 objects degraded (17.921%)
1 stale+active+clean+scrubbing+deep
   15 active
 1247 active+clean
1 active+recovery_wait
   45 stale+active+clean
   39 peering
   29 stale+active+degraded+wait_backfill
  252 down+peering
  827 active+degraded
   50 stale+active+degraded
   20 stale+active+degraded+remapped+wait_backfill
   30 stale+incomplete
4 active+clean+scrubbing+deep

Here's a snippet of ceph.log for one of these OSDs:
2014-05-07 09:22:46.747036 mon.0 10.193.0.6:6789/0 39981 : [INF] osd.3
marked down after no pg stats for 901.212859seconds
2014-05-07 09:47:17.930251 mon.0 10.193.0.6:6789/0 40561 : [INF] osd.3
10.193.0.6:6812/2830 boot
2014-05-07 09:47:16.914519 osd.3 10.193.0.6:6812/2830 823 : [WRN] map
e38649 wrongly marked me down

root@ceph0c:~# uname -a
Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root@ceph0c:~# lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 12.04.4 LTS
Release:12.04
Codename:precise
root@ceph0c:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)


Any ideas what I can do to make these OSDs stop drying after 15 minutes?




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website  | Twitter
  | Facebook
  | LinkedIn
  | Blog




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




Re: [ceph-users] Ovirt

2014-05-07 Thread Wido den Hollander

On 05/07/2014 08:14 PM, Neil Levine wrote:

We were actually talking to Red Hat about oVirt support before the
acquisition. It's on the To Do list but no dates yet.
Of course, someone from the community is welcome to step up and do the work.



I looked at it some time ago. I noticed that oVirt relies on libvirt for 
it's storage, just like CloudStack. Since there already is RBD storage 
pool support in libvirt it shouldn't be that much work I think.


I however never used oVirt, so I can't tell for sure.


Neil

On Wed, May 7, 2014 at 9:49 AM, Nathan Stratton  wrote:

Now that everyone will be one big happy family, any new on ceph support of
ovirt?


<>

nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
www.broadsoft.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [ANN] ceph-deploy 1.5.2 released

2014-05-07 Thread Alfredo Deza
Hi All,

There is a new bug-fix release of ceph-deploy, the easy deployment
tool for Ceph.

This release comes with two important changes:

* fix usage of `--` when removing packages in Debian/Ubuntu
* Default to Firefly when installing Ceph.

Make sure you upgrade!


-Alfredo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Mike Dawson

Craig,

I suspect the disks in question are seeking constantly and the spindle 
contention is causing significant latency. A strategy of throttling 
backfill/recovery and reducing client traffic tends to work for me.


1) You should make sure recovery and backfill are throttled:
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'

2) We run a not-particularly critical service with a constant stream of 
95% write/5% read small, random IO. During recovery/backfill, we are 
heavily bound by IOPS. It often times feels like a net win to throttle 
unessential client traffic in an effort to get spindle contention under 
control if Step 1 wasn't enough.


If that all fails, you can try "ceph osd set nodown" which will prevent 
OSDs from being marked down (with or without proper cause), but that 
tends to cause me more trouble than its worth.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 5/7/2014 1:28 PM, Craig Lewis wrote:

The 5 OSDs that are down have all been kicked out for being
unresponsive.  The 5 OSDs are getting kicked faster than they can
complete the recovery+backfill.  The number of degraded PGs is growing
over time.

root@ceph0c:~# ceph -w
 cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120
  health HEALTH_WARN 49 pgs backfill; 926 pgs degraded; 252 pgs
down; 30 pgs incomplete; 291 pgs peering; 1 pgs recovery_wait; 175 pgs
stale; 255 pgs stuck inactive; 175 pgs stuck stale; 1234 pgs stuck
unclean; 66 requests are blocked > 32 sec; recovery 6820014/3806
objects degraded (17.921%); 4/16 in osds are down; noout flag(s) set
  monmap e2: 2 mons at
{ceph0c=10.193.0.6:6789/0,ceph1c=10.193.0.7:6789/0}, election epoch 238,
quorum 0,1 ceph0c,ceph1c
  osdmap e38673: 16 osds: 12 up, 16 in
 flags noout
   pgmap v7325233: 2560 pgs, 17 pools, 14090 GB data, 18581 kobjects
 28456 GB used, 31132 GB / 59588 GB avail
 6820014/3806 objects degraded (17.921%)
1 stale+active+clean+scrubbing+deep
   15 active
 1247 active+clean
1 active+recovery_wait
   45 stale+active+clean
   39 peering
   29 stale+active+degraded+wait_backfill
  252 down+peering
  827 active+degraded
   50 stale+active+degraded
   20 stale+active+degraded+remapped+wait_backfill
   30 stale+incomplete
4 active+clean+scrubbing+deep

Here's a snippet of ceph.log for one of these OSDs:
2014-05-07 09:22:46.747036 mon.0 10.193.0.6:6789/0 39981 : [INF] osd.3
marked down after no pg stats for 901.212859seconds
2014-05-07 09:47:17.930251 mon.0 10.193.0.6:6789/0 40561 : [INF] osd.3
10.193.0.6:6812/2830 boot
2014-05-07 09:47:16.914519 osd.3 10.193.0.6:6812/2830 823 : [WRN] map
e38649 wrongly marked me down

root@ceph0c:~# uname -a
Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root@ceph0c:~# lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 12.04.4 LTS
Release:12.04
Codename:precise
root@ceph0c:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)


Any ideas what I can do to make these OSDs stop drying after 15 minutes?




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter
  | Facebook
  | LinkedIn
  | Blog




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Gregory Farnum
On Wed, May 7, 2014 at 11:18 AM, Mike Dawson  wrote:
>
> On 5/7/2014 11:53 AM, Gregory Farnum wrote:
>>
>> On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
>>  wrote:
>>>
>>> Hi,
>>>
>>>
>>> Sage Weil wrote:
>>>
>>> * *Primary affinity*: Ceph now has the ability to skew selection of
>>>OSDs as the "primary" copy, which allows the read workload to be
>>>cheaply skewed away from parts of the cluster without migrating any
>>>data.
>>>
>>>
>>> Can you please elaborate a bit on this one? I found the blueprint [1] but
>>> still don't quite understand how it works. Does this only change the
>>> crush
>>> calculation for reads? i.e writes still go to the usual primary, but
>>> reads
>>> are distributed across the replicas? If so, does this change the
>>> consistency
>>> model in any way.
>>
>>
>> It changes the calculation of who becomes the primary, and that
>> primary serves both reads and writes. In slightly more depth:
>> Previously, the primary has always been the first OSD chosen as a
>> member of the PG.
>> For erasure coding, we added the ability to specify a primary
>> independent of the selection ordering. This was part of a broad set of
>> changes to prevent moving the EC "shards" around between different
>> members of the PG, and means that the primary might be the second OSD
>> in the PG, or the fourth.
>> Once this work existed, we realized that it might be useful in other
>> cases, because primaries get more of the work for their PG (serving
>> all reads, coordinating writes).
>> So we added the ability to specify a "primary affinity", which is like
>> the CRUSH weights but only impacts whether you become the primary. So
>> if you have 3 OSDs that each have primary affinity = 1, it will behave
>> as normal. If two have primary affinity = 0, the remaining OSD will be
>> the primary. Etc.
>
>
> Is it possible (and/or advisable) to set primary affinity low while
> backfilling / recovering an OSD in an effort to prevent unnecessary slow
> reads that could be directed to less busy replicas?

I have no experimental data and haven't thought about it in the past,
but that sounds like it might be helpful, yeah!.
Your clients will need to support this feature, so if you're using
kernel clients you need a very new kernel (I don't remember exactly
which one).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

> I suppose if the cost of
> setting/unsetting primary affinity is low and clients are starved for reads
> during backfill/recovery from the osd in question, it could be a win.
>
> Perhaps the workflow for maintenance on osd.0 would be something like:
>
> - Stop osd.0, do some maintenance on osd.0
> - Read primary affinity of osd.0, store it for later
> - Set primary affinity on osd.0 to 0
> - Start osd.0
> - Enjoy a better backfill/recovery experience. RBD clients happier.
> - Reset primary affinity on osd.0 to previous value
>
> If the cost of setting primary affinity is low enough, perhaps this strategy
> could be automated by the ceph daemons.
>
> Thanks,
> Mike Dawson
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot revert lost objects

2014-05-07 Thread Kevin Horan


It is still "querying", after 6 days now. I have not tried any 
scrubbing options, I'll try them just to see. My next idea was to 
clobber osd 8, the one it is supposedly "querying".




   I ran into this problem too.  I don't know what I did to fix it.

   I tried ceph pg scrub , ceph pg deep-scrub , and ceph
   osd scrub .  None of them had an immediate effect.  In the
   end, it finally cleared several days later in the middle of the
   night.  I can't even say what or when it finally cleared.  A
   different OSDs got kicked out, then rejoined.  While everything was
   moving from degraded to active+clean, it finally finished probing.

   If it's still happening tomorrow, I'd try to find a Geeks on IRC
   Duty (http://ceph.com/help/community/).


   On 5/3/14 09:43 , Kevin Horan wrote:

Craig,
Thanks for your response. I have already marked osd.6 as lost,
as you suggested. The problem is that it is still querying osd.8
which is not lost. I don't know why it is stuck there. It has been
querying osd.8 for 4 days now.
I also tried deleting the broken RBD image but the operation
just hangs.

Kevin


On 5/1/14 10:11 , kevin horan wrote:

Here is how I got into this state. I have only 6 OSDs total,
3 on one host (vashti) and 3 on another (zadok). I set the
noout flag so I could reboot zadok. Zadok was down for 2
minutes. When it came up ceph began recovering the objects
that had not been replicated yet. Before recovery finished,
osd.6, on vashti, died (IO errors on disk, whole drive
un-recoverable). Since osd.6 had objects that had not yet had
a chance to replicate to any OSD on zadok, they were lost. I
cannot recover anything further from osd.6.



I'm pretty far out of my element here, but if osd.6 is gone,
it might help to mark it lost:
ceph osd lost 6

I had similiar issues when I lost some PGs.  I don't think
that it actually fixed my issue, but marking osds as lost did
help Ceph move forward.


You could also try deleting the broken RBD image, and see if
that helps.



   -- 


   Craig Lewis


Senior Systems Engineer


Office +1.714.602.1309


Email clewis-04jk9tcbggyp2ihm84uzcnbpr1lh4...@public.gmane.org 



Central Desktop. Work together in ways you never thought possible.


Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] How to install CEPH on CentOS 6.3

2014-05-07 Thread Aaron Ten Clay
On Tue, May 6, 2014 at 7:35 PM, Ease Lu  wrote:

> Hi All,
>  As following the CEPH online document, I tried to install CEPH on
> centos 6.3:
>
>  The step: ADD CEPH
>   I cannot find centos distro, so I used el6. when I reach the
> "INTSALL VIRTUALIZATION FOR BLOCK DEVICE" step, I got:
>
> Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
>Requires: libusbredirparser.so.1()(64bit)
> Error: Package: 2:qemu-img-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
>Requires: libusbredirparser.so.1()(64bit)
> Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
>Requires: libspice-server.so.1(SPICE_SERVER_0.11.2)(64bit)
> Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
>Requires: libspice-server.so.1(SPICE_SERVER_0.12.4)(64bit)
> Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
>Requires: seabios >= 0.6.1.2-20.el6
>Available: seabios-0.6.1.2-19.el6.x86_64 (qa_os_centos6.3_64)
>seabios = 0.6.1.2-19.el6
> Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
>
>
>  Would you please tell me how to resolve the issue?
>
> Best Regards,
> Jack
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
>
Hi,

You will probably get more help from the ceph-users list. I've CC'd your
message.

-Aaron
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Mike Dawson


On 5/7/2014 11:53 AM, Gregory Farnum wrote:

On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
 wrote:

Hi,


Sage Weil wrote:

* *Primary affinity*: Ceph now has the ability to skew selection of
   OSDs as the "primary" copy, which allows the read workload to be
   cheaply skewed away from parts of the cluster without migrating any
   data.


Can you please elaborate a bit on this one? I found the blueprint [1] but
still don't quite understand how it works. Does this only change the crush
calculation for reads? i.e writes still go to the usual primary, but reads
are distributed across the replicas? If so, does this change the consistency
model in any way.


It changes the calculation of who becomes the primary, and that
primary serves both reads and writes. In slightly more depth:
Previously, the primary has always been the first OSD chosen as a
member of the PG.
For erasure coding, we added the ability to specify a primary
independent of the selection ordering. This was part of a broad set of
changes to prevent moving the EC "shards" around between different
members of the PG, and means that the primary might be the second OSD
in the PG, or the fourth.
Once this work existed, we realized that it might be useful in other
cases, because primaries get more of the work for their PG (serving
all reads, coordinating writes).
So we added the ability to specify a "primary affinity", which is like
the CRUSH weights but only impacts whether you become the primary. So
if you have 3 OSDs that each have primary affinity = 1, it will behave
as normal. If two have primary affinity = 0, the remaining OSD will be
the primary. Etc.


Is it possible (and/or advisable) to set primary affinity low while 
backfilling / recovering an OSD in an effort to prevent unnecessary slow 
reads that could be directed to less busy replicas? I suppose if the 
cost of setting/unsetting primary affinity is low and clients are 
starved for reads during backfill/recovery from the osd in question, it 
could be a win.


Perhaps the workflow for maintenance on osd.0 would be something like:

- Stop osd.0, do some maintenance on osd.0
- Read primary affinity of osd.0, store it for later
- Set primary affinity on osd.0 to 0
- Start osd.0
- Enjoy a better backfill/recovery experience. RBD clients happier.
- Reset primary affinity on osd.0 to previous value

If the cost of setting primary affinity is low enough, perhaps this 
strategy could be automated by the ceph daemons.


Thanks,
Mike Dawson


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ovirt

2014-05-07 Thread Neil Levine
We were actually talking to Red Hat about oVirt support before the
acquisition. It's on the To Do list but no dates yet.
Of course, someone from the community is welcome to step up and do the work.

Neil

On Wed, May 7, 2014 at 9:49 AM, Nathan Stratton  wrote:
> Now that everyone will be one big happy family, any new on ceph support of
> ovirt?
>
>><>
> nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
> www.broadsoft.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] health HEALTH_WARN too few pgs per osd (16 < min 20)

2014-05-07 Thread Sergey Malinin
On Wednesday, May 7, 2014 at 20:28, *sm1Ly wrote:
> 
> [sm1ly@salt1 ceph]$ sudo ceph -s
> cluster 0b2c9c20-985a-4a39-af8e-ef2325234744
>  health HEALTH_WARN 19 pgs degraded; 192 pgs stuck unclean; recovery 
> 21/42 objects degraded (50.000%); too few pgs per osd (16 < min 20)
> 

You might need to adjust default number of PGs per pool and recreate pools.
http://ceph.com/docs/master/rados/operations/placement-groups/
http://ceph.com/docs/master/rados/operations/pools/#createpool

>  monmap e1: 3 mons at 
> {mon1=10.60.0.110:6789/0,mon2=10.60.0.111:6789/0,mon3=10.60.0.112:6789/0 
> (http://10.60.0.110:6789/0,mon2=10.60.0.111:6789/0,mon3=10.60.0.112:6789/0)}, 
> election epoch 6, quorum 0,1,2 mon1,mon2,mon3
>  mdsmap e6: 1/1/1 up {0=mds1=up:active}, 2 up:standby
>  osdmap e61: 12 osds: 12 up, 12 in
>   pgmap v103: 192 pgs, 3 pools, 9470 bytes data, 21 objects
> 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] health HEALTH_WARN too few pgs per osd (16 < min 20)

2014-05-07 Thread Henrik Korkuc
On 2014.05.07 20:28, *sm1Ly wrote:
> I got deploy my cluster with this commans.
>
> mkdir "clustername"
>  
> cd "clustername"
>  
> ceph-deploy install mon1 mon2 mon3 mds1 mds2 mds3 osd200
>  
> ceph-deploy  new  mon1 mon2 mon3
>  
> ceph-deploy mon create  mon1 mon2 mon3
>  
> ceph-deploy gatherkeys  mon1 mon2 mon3
>  
> ceph-deploy osd prepare --fs-type ext4 osd200:/osd/osd1
> osd200:/osd/osd2 osd200:/osd/osd3 osd200:/osd/osd4 osd200:/osd/osd5
> osd200:/osd/osd6 osd200:/osd/osd7 osd200:/osd/osd8 osd200:/osd/osd9
> osd200:/osd/osd10 osd200:/osd/osd11 osd200:/osd/osd12
>
> ceph-deploy osd activate osd200:/osd/osd1 osd200:/osd/osd2
> osd200:/osd/osd3 osd200:/osd/osd4 osd200:/osd/osd5 osd200:/osd/osd6
> osd200:/osd/osd7 osd200:/osd/osd8 osd200:/osd/osd9 osd200:/osd/osd10
> osd200:/osd/osd11 osd200:/osd/osd12
>
>  
> ceph-deploy admin mon1 mon2 mon3 mds1 mds2 mds3 osd200 salt1
>  
> ceph-deploy mds create mds1 mds2 mds3
>
> but in the end...:
>  
> sudo ceph -s
>
> [sm1ly@salt1 ceph]$ sudo ceph -s
> cluster 0b2c9c20-985a-4a39-af8e-ef2325234744
>  health HEALTH_WARN 19 pgs degraded; 192 pgs stuck unclean;
> recovery 21/42 objects degraded (50.000%); too few pgs per osd (16 <
> min 20)
>  monmap e1: 3 mons at
> {mon1=10.60.0.110:6789/0,mon2=10.60.0.111:6789/0,mon3=10.60.0.112:6789/0
> },
> election epoch 6, quorum 0,1,2 mon1,mon2,mon3
>  mdsmap e6: 1/1/1 up {0=mds1=up:active}, 2 up:standby
>  osdmap e61: 12 osds: 12 up, 12 in
>   pgmap v103: 192 pgs, 3 pools, 9470 bytes data, 21 objects
> 63751 MB used, 3069 GB / 3299 GB avail
> 21/42 objects degraded (50.000%)
>  159 active
>   14 active+remapped
>   19 active+degraded
>
>
> mon[123] and mds[123] are vms. osd200 - hardware server, cause on vms
> it shows bad perfomance?
>
> some searching talks me that the problem that I have only one osd
> node. can I ignore it for tests?
"19 pgs degraded; 192 pgs stuck unclean; recovery 21/42 objects degraded
(50.000%)" you can ignore it, or edit crush map, so failure domain would
be osd, not host

> another search talks me about pg groups, but I cant find how to get pgid.
"too few pgs per osd (16 < min 20)" - increase pg_num and pgp_num
>
>
> -- 
> yours respectfully, Alexander Vasin.
>
> 8 926 1437200
> icq: 9906064
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] health HEALTH_WARN too few pgs per osd (16 < min 20)

2014-05-07 Thread *sm1Ly
I got deploy my cluster with this commans.

mkdir "clustername"

cd "clustername"

ceph-deploy install mon1 mon2 mon3 mds1 mds2 mds3 osd200

ceph-deploy  new  mon1 mon2 mon3

ceph-deploy mon create  mon1 mon2 mon3

ceph-deploy gatherkeys  mon1 mon2 mon3

ceph-deploy osd prepare --fs-type ext4 osd200:/osd/osd1 osd200:/osd/osd2
osd200:/osd/osd3 osd200:/osd/osd4 osd200:/osd/osd5 osd200:/osd/osd6
osd200:/osd/osd7 osd200:/osd/osd8 osd200:/osd/osd9 osd200:/osd/osd10
osd200:/osd/osd11 osd200:/osd/osd12

ceph-deploy osd activate osd200:/osd/osd1 osd200:/osd/osd2 osd200:/osd/osd3
osd200:/osd/osd4 osd200:/osd/osd5 osd200:/osd/osd6 osd200:/osd/osd7
osd200:/osd/osd8 osd200:/osd/osd9 osd200:/osd/osd10 osd200:/osd/osd11
osd200:/osd/osd12


ceph-deploy admin mon1 mon2 mon3 mds1 mds2 mds3 osd200 salt1

ceph-deploy mds create mds1 mds2 mds3

but in the end...:

sudo ceph -s

[sm1ly@salt1 ceph]$ sudo ceph -s
cluster 0b2c9c20-985a-4a39-af8e-ef2325234744
 health HEALTH_WARN 19 pgs degraded; 192 pgs stuck unclean; recovery
21/42 objects degraded (50.000%); too few pgs per osd (16 < min 20)
 monmap e1: 3 mons at {mon1=
10.60.0.110:6789/0,mon2=10.60.0.111:6789/0,mon3=10.60.0.112:6789/0},
election epoch 6, quorum 0,1,2 mon1,mon2,mon3
 mdsmap e6: 1/1/1 up {0=mds1=up:active}, 2 up:standby
 osdmap e61: 12 osds: 12 up, 12 in
  pgmap v103: 192 pgs, 3 pools, 9470 bytes data, 21 objects
63751 MB used, 3069 GB / 3299 GB avail
21/42 objects degraded (50.000%)
 159 active
  14 active+remapped
  19 active+degraded


mon[123] and mds[123] are vms. osd200 - hardware server, cause on vms it
shows bad perfomance?

some searching talks me that the problem that I have only one osd node. can
I ignore it for tests?
another search talks me about pg groups, but I cant find how to get pgid.


-- 
yours respectfully, Alexander Vasin.

8 926 1437200
icq: 9906064
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis
The 5 OSDs that are down have all been kicked out for being 
unresponsive.  The 5 OSDs are getting kicked faster than they can 
complete the recovery+backfill.  The number of degraded PGs is growing 
over time.


root@ceph0c:~# ceph -w
cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120
 health HEALTH_WARN 49 pgs backfill; 926 pgs degraded; 252 pgs 
down; 30 pgs incomplete; 291 pgs peering; 1 pgs recovery_wait; 175 pgs 
stale; 255 pgs stuck inactive; 175 pgs stuck stale; 1234 pgs stuck 
unclean; 66 requests are blocked > 32 sec; recovery 6820014/3806 
objects degraded (17.921%); 4/16 in osds are down; noout flag(s) set
 monmap e2: 2 mons at 
{ceph0c=10.193.0.6:6789/0,ceph1c=10.193.0.7:6789/0}, election epoch 238, 
quorum 0,1 ceph0c,ceph1c

 osdmap e38673: 16 osds: 12 up, 16 in
flags noout
  pgmap v7325233: 2560 pgs, 17 pools, 14090 GB data, 18581 kobjects
28456 GB used, 31132 GB / 59588 GB avail
6820014/3806 objects degraded (17.921%)
   1 stale+active+clean+scrubbing+deep
  15 active
1247 active+clean
   1 active+recovery_wait
  45 stale+active+clean
  39 peering
  29 stale+active+degraded+wait_backfill
 252 down+peering
 827 active+degraded
  50 stale+active+degraded
  20 stale+active+degraded+remapped+wait_backfill
  30 stale+incomplete
   4 active+clean+scrubbing+deep

Here's a snippet of ceph.log for one of these OSDs:
2014-05-07 09:22:46.747036 mon.0 10.193.0.6:6789/0 39981 : [INF] osd.3 
marked down after no pg stats for 901.212859seconds
2014-05-07 09:47:17.930251 mon.0 10.193.0.6:6789/0 40561 : [INF] osd.3 
10.193.0.6:6812/2830 boot
2014-05-07 09:47:16.914519 osd.3 10.193.0.6:6812/2830 823 : [WRN] map 
e38649 wrongly marked me down


root@ceph0c:~# uname -a
Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12 
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

root@ceph0c:~# lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 12.04.4 LTS
Release:12.04
Codename:precise
root@ceph0c:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)


Any ideas what I can do to make these OSDs stop drying after 15 minutes?




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pool .rgw.bucket and objects within it

2014-05-07 Thread Thanh Tran
thanks Irek, it is correct as you did.

Best regards,
Thanh Tran


On Wed, May 7, 2014 at 2:15 PM, Irek Fasikhov  wrote:

> Yes, delete all the objects stored in the pool.
>
>
> 2014-05-07 6:58 GMT+04:00 Thanh Tran :
>
>> Hi,
>>
>> If i use command "ceph osd pool delete .rgw.bucket .rgw.bucket
>> --yes-i-really-really-mean-it" to delete the pool .rgw.bucket, will this
>> delete the pool, its objects and clean the data on osds?
>>
>> Best regards,
>> Thanh Tran
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ovirt

2014-05-07 Thread Nathan Stratton
Now that everyone will be one big happy family, any new on ceph support of
ovirt?

><>
nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
www.broadsoft.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Gregory Farnum
On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
 wrote:
> Hi,
>
>
> Sage Weil wrote:
>
> * *Primary affinity*: Ceph now has the ability to skew selection of
>   OSDs as the "primary" copy, which allows the read workload to be
>   cheaply skewed away from parts of the cluster without migrating any
>   data.
>
>
> Can you please elaborate a bit on this one? I found the blueprint [1] but
> still don't quite understand how it works. Does this only change the crush
> calculation for reads? i.e writes still go to the usual primary, but reads
> are distributed across the replicas? If so, does this change the consistency
> model in any way.

It changes the calculation of who becomes the primary, and that
primary serves both reads and writes. In slightly more depth:
Previously, the primary has always been the first OSD chosen as a
member of the PG.
For erasure coding, we added the ability to specify a primary
independent of the selection ordering. This was part of a broad set of
changes to prevent moving the EC "shards" around between different
members of the PG, and means that the primary might be the second OSD
in the PG, or the fourth.
Once this work existed, we realized that it might be useful in other
cases, because primaries get more of the work for their PG (serving
all reads, coordinating writes).
So we added the ability to specify a "primary affinity", which is like
the CRUSH weights but only impacts whether you become the primary. So
if you have 3 OSDs that each have primary affinity = 1, it will behave
as normal. If two have primary affinity = 0, the remaining OSD will be
the primary. Etc.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Sage Weil
On Wed, 7 May 2014, Dan van der Ster wrote:
> Hi,
> 
> Sage Weil wrote:
> 
> * *Primary affinity*: Ceph now has the ability to skew selection of
>   OSDs as the "primary" copy, which allows the read workload to be
>   cheaply skewed away from parts of the cluster without migrating any
>   data.
> 
> 
> Can you please elaborate a bit on this one? I found the blueprint [1] but
> still don't quite understand how it works. Does this only change the crush
> calculation for reads? i.e writes still go to the usual primary, but reads
> are distributed across the replicas? If so, does this change the consistency
> model in any way.

It basically just skews the choice of which replica is the primary.  No 
data has to move, but the read workload and write overhead associated with 
being the primary (driving recovery and forwarding writes) is diverted 
away from the nodes whose 'affinity' is reduced from the default/baseline.

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tiering

2014-05-07 Thread Mark Nelson

On 05/07/2014 10:38 AM, Gregory Farnum wrote:

On Wed, May 7, 2014 at 8:13 AM, Dan van der Ster
 wrote:

Hi,


Gregory Farnum wrote:

3) The cost of a cache miss is pretty high, so they should only be
used when the active set fits within the cache and doesn't change too
frequently.


Can you roughly quantify how long a cache miss would take? Naively I'd
assume it would turn one read into a read from the backing pool, a write
into the cache pool, then the read from the cache. Is that right?


Yes, that's roughly it. The part you're leaving out is that a write
may also require promotion, and if it does and the cache is full then
it requires an eviction, and that requires writes to the backing
pool...
Also, doubling the latency on a read can cross a lot of "I don't
notice it" boundaries.


So, Ceph will not automatically redirect to the base pool in case of
failures; in the general case it *can't*, but you could set up
monitoring to remove a read-only pool if that happens. But in general,
I would only explore cache pools if you expect to periodically pull in
working data sets out of much larger sets of cold data (e.g., jobs run
against a particular bit of scientific data out of your entire
archive).


That's a pity. What would be your hesitation about using WB caching with RBD
images, assuming the cache pool is sized large enough to match the working
set.


Just a general lack of data indicating it performs well. It will
certainly function, and if you have e.g. 1/4 of your RBD volumes in
use at a time according to time of day, I would expect it to do just
fine.


From what we've seen so far, there are definitely some tradeoffs with 
tiering.  Using it in the wrong way (IE for pools that have little hot 
data) can actually decrease overall performance.  We're working on doing 
some RBD tests with different skewed distributions now.



-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Dan van der Ster

Hi,

Sage Weil wrote:

**Primary affinity*: Ceph now has the ability to skew selection of
   OSDs as the "primary" copy, which allows the read workload to be
   cheaply skewed away from parts of the cluster without migrating any
   data.


Can you please elaborate a bit on this one? I found the blueprint [1] 
but still don't quite understand how it works. Does this only change the 
crush calculation for reads? i.e writes still go to the usual primary, 
but reads are distributed across the replicas? If so, does this change 
the consistency model in any way.


Cheers, Dan



[1] 
http://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role_affinity
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tiering

2014-05-07 Thread Gregory Farnum
On Wed, May 7, 2014 at 8:13 AM, Dan van der Ster
 wrote:
> Hi,
>
>
> Gregory Farnum wrote:
>
> 3) The cost of a cache miss is pretty high, so they should only be
> used when the active set fits within the cache and doesn't change too
> frequently.
>
>
> Can you roughly quantify how long a cache miss would take? Naively I'd
> assume it would turn one read into a read from the backing pool, a write
> into the cache pool, then the read from the cache. Is that right?

Yes, that's roughly it. The part you're leaving out is that a write
may also require promotion, and if it does and the cache is full then
it requires an eviction, and that requires writes to the backing
pool...
Also, doubling the latency on a read can cross a lot of "I don't
notice it" boundaries.

> So, Ceph will not automatically redirect to the base pool in case of
> failures; in the general case it *can't*, but you could set up
> monitoring to remove a read-only pool if that happens. But in general,
> I would only explore cache pools if you expect to periodically pull in
> working data sets out of much larger sets of cold data (e.g., jobs run
> against a particular bit of scientific data out of your entire
> archive).
>
>
> That's a pity. What would be your hesitation about using WB caching with RBD
> images, assuming the cache pool is sized large enough to match the working
> set.

Just a general lack of data indicating it performs well. It will
certainly function, and if you have e.g. 1/4 of your RBD volumes in
use at a time according to time of day, I would expect it to do just
fine.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Sage Weil
On Wed, 7 May 2014, Kenneth Waegeman wrote:
> - Message from Sage Weil  -
>   Date: Tue, 6 May 2014 18:05:19 -0700 (PDT)
>   From: Sage Weil 
> Subject: [ceph-users] v0.80 Firefly released
> To: ceph-de...@vger.kernel.org, ceph-us...@ceph.com
> 
> 
> > We did it!  Firefly v0.80 is built and pushed out to the ceph.com
> > repositories.
> > 
> > This release will form the basis for our long-term supported release
> > Firefly, v0.80.x.  The big new features are support for erasure coding
> > and cache tiering, although a broad range of other features, fixes,
> > and improvements have been made across the code base.  Highlights include:
> > 
> > * *Erasure coding*: support for a broad range of erasure codes for lower
> >  storage overhead and better data durability.
> > * *Cache tiering*: support for creating 'cache pools' that store hot,
> >  recently accessed objects with automatic demotion of colder data to
> >  a base tier.  Typically the cache pool is backed by faster storage
> >  devices like SSDs.
> > * *Primary affinity*: Ceph now has the ability to skew selection of
> >  OSDs as the "primary" copy, which allows the read workload to be
> >  cheaply skewed away from parts of the cluster without migrating any
> >  data.
> > * *Key/value OSD backend* (experimental): An alternative storage backend
> >  for Ceph OSD processes that puts all data in a key/value database like
> >  leveldb.  This provides better performance for workloads dominated by
> >  key/value operations (like radosgw bucket indices).
> 
> Nice!
> Is there already some documentation about this Key/value OSD back-end topic,
> like how to use, restrictions, ..?

Not yet!  To get started playing with it, you can put 'osd objectstore = 
keyvaluestore-dev' to your ceph.conf.

> A question (referring to
> http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/):
> Do we need a journal when using this back-end?

There is no journal needed because leveldb (and, soon, rocksdb) are 
transactional.

> And is this compatible for use with CephFS?

Yes. However, rbd and cephfs workloads include partial object updates 
which are likely to be pretty inefficient on top of these backends, so 
this is likely not a panacea.  And it is still early days!  Most of this 
work is being done by Haomai Wang at Unitedstack, CCed.

sage


> 
> Thanks!
> 
> 
> > * *Standalone radosgw* (experimental): The radosgw process can now run
> >  in a standalone mode without an apache (or similar) web server or
> >  fastcgi.  This simplifies deployment and can improve performance.
> > 
> > We expect to maintain a series of stable releases based on v0.80
> > Firefly for as much as a year.  In the meantime, development of Ceph
> > continues with the next release, Giant, which will feature work on the
> > CephFS distributed file system, more alternative storage backends
> > (like RocksDB and f2fs), RDMA support, support for pyramid erasure
> > codes, and additional functionality in the block device (RBD) like
> > copy-on-read and multisite mirroring.
> > 
> > This release is the culmination of a huge collective effort by about 100
> > different contributors.  Thank you everyone who has helped to make this
> > possible!
> > 
> > Upgrade Sequencing
> > --
> > 
> > * If your existing cluster is running a version older than v0.67
> >  Dumpling, please first upgrade to the latest Dumpling release before
> >  upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling upgrade`
> >  documentation.
> > 
> > * Upgrade daemons in the following order:
> > 
> >1. Monitors
> >2. OSDs
> >3. MDSs and/or radosgw
> > 
> >  If the ceph-mds daemon is restarted first, it will wait until all
> >  OSDs have been upgraded before finishing its startup sequence.  If
> >  the ceph-mon daemons are not restarted prior to the ceph-osd
> >  daemons, they will not correctly register their new capabilities
> >  with the cluster and new features may not be usable until they are
> >  restarted a second time.
> > 
> > * Upgrade radosgw daemons together.  There is a subtle change in behavior
> >  for multipart uploads that prevents a multipart request that was initiated
> >  with a new radosgw from being completed by an old radosgw.
> > 
> > Notable changes since v0.79
> > ---
> > 
> > * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
> > * ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
> > * librados: fix inconsistencies in API error values (David Zafman)
> > * librados: fix watch operations with cache pools (Sage Weil)
> > * librados: new snap rollback operation (David Zafman)
> > * mds: fix respawn (John Spray)
> > * mds: misc bugs (Yan, Zheng)
> > * mds: misc multi-mds fixes (Yan, Zheng)
> > * mds: use shared_ptr for requests (Greg Farnum)
> > * mon: fix peer feature checks (Sage Weil)
> > * mon: require 'x' mon caps for auth operations (Joao Luis)
> > * mon: shutdown when 

Re: [ceph-users] Cache tiering

2014-05-07 Thread Sage Weil
On Wed, 7 May 2014, Gandalf Corvotempesta wrote:
> Very simple question: what happen if server bound to the cache pool goes down?
> For example, a read-only cache could be archived by using a single
> server with no redudancy.
> Is ceph smart enough to detect that cache is unavailable and
> transparently redirect all request to the main pool as usual ?

This would make sense only for the readonly cache mode, where the cache 
version will be identical to the base pool version.  Right now the answer 
is no--it's not smart enough to do that.

In general, you want to have redundancy in the cache pool, too, so 
that you can tolerate disk and node failures in the cache.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Kenneth Waegeman


- Message from Alexandre DERUMIER  -
   Date: Wed, 07 May 2014 15:21:55 +0200 (CEST)
   From: Alexandre DERUMIER 
Subject: Re: [ceph-users] v0.80 Firefly released
 To: Kenneth Waegeman 
 Cc: ceph-us...@ceph.com, Sage Weil 



Do we need a journal when using this back-end?


no,they are no journal with key value store


Thanks, And how can I activate this?


- Mail original -

De: "Kenneth Waegeman" 
À: "Sage Weil" 
Cc: ceph-us...@ceph.com
Envoyé: Mercredi 7 Mai 2014 15:06:50
Objet: Re: [ceph-users] v0.80 Firefly released


- Message from Sage Weil  -
Date: Tue, 6 May 2014 18:05:19 -0700 (PDT)
From: Sage Weil 
Subject: [ceph-users] v0.80 Firefly released
To: ceph-de...@vger.kernel.org, ceph-us...@ceph.com



We did it! Firefly v0.80 is built and pushed out to the ceph.com
repositories.

This release will form the basis for our long-term supported release
Firefly, v0.80.x. The big new features are support for erasure coding
and cache tiering, although a broad range of other features, fixes,
and improvements have been made across the code base. Highlights include:

* *Erasure coding*: support for a broad range of erasure codes for lower
storage overhead and better data durability.
* *Cache tiering*: support for creating 'cache pools' that store hot,
recently accessed objects with automatic demotion of colder data to
a base tier. Typically the cache pool is backed by faster storage
devices like SSDs.
* *Primary affinity*: Ceph now has the ability to skew selection of
OSDs as the "primary" copy, which allows the read workload to be
cheaply skewed away from parts of the cluster without migrating any
data.
* *Key/value OSD backend* (experimental): An alternative storage backend
for Ceph OSD processes that puts all data in a key/value database like
leveldb. This provides better performance for workloads dominated by
key/value operations (like radosgw bucket indices).


Nice!
Is there already some documentation about this Key/value OSD back-end
topic, like how to use, restrictions, ..?
A question (referring to
http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/): Do we need a journal when using  
this

back-end?
And is this compatible for use with CephFS?

Thanks!



* *Standalone radosgw* (experimental): The radosgw process can now run
in a standalone mode without an apache (or similar) web server or
fastcgi. This simplifies deployment and can improve performance.

We expect to maintain a series of stable releases based on v0.80
Firefly for as much as a year. In the meantime, development of Ceph
continues with the next release, Giant, which will feature work on the
CephFS distributed file system, more alternative storage backends
(like RocksDB and f2fs), RDMA support, support for pyramid erasure
codes, and additional functionality in the block device (RBD) like
copy-on-read and multisite mirroring.

This release is the culmination of a huge collective effort by about 100
different contributors. Thank you everyone who has helped to make this
possible!

Upgrade Sequencing
--

* If your existing cluster is running a version older than v0.67
Dumpling, please first upgrade to the latest Dumpling release before
upgrading to v0.80 Firefly. Please refer to the :ref:`Dumpling upgrade`
documentation.

* Upgrade daemons in the following order:

1. Monitors
2. OSDs
3. MDSs and/or radosgw

If the ceph-mds daemon is restarted first, it will wait until all
OSDs have been upgraded before finishing its startup sequence. If
the ceph-mon daemons are not restarted prior to the ceph-osd
daemons, they will not correctly register their new capabilities
with the cluster and new features may not be usable until they are
restarted a second time.

* Upgrade radosgw daemons together. There is a subtle change in behavior
for multipart uploads that prevents a multipart request that was initiated
with a new radosgw from being completed by an old radosgw.

Notable changes since v0.79
---

* ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
* ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
* librados: fix inconsistencies in API error values (David Zafman)
* librados: fix watch operations with cache pools (Sage Weil)
* librados: new snap rollback operation (David Zafman)
* mds: fix respawn (John Spray)
* mds: misc bugs (Yan, Zheng)
* mds: misc multi-mds fixes (Yan, Zheng)
* mds: use shared_ptr for requests (Greg Farnum)
* mon: fix peer feature checks (Sage Weil)
* mon: require 'x' mon caps for auth operations (Joao Luis)
* mon: shutdown when removed from mon cluster (Joao Luis)
* msgr: fix locking bug in authentication (Josh Durgin)
* osd: fix bug in journal replay/restart (Sage Weil)
* osd: many many many bug fixes with cache tiering (Samuel Just)
* osd: track omap and hit_set objects in pg stats (Samuel Just)
* osd: warn if agent cannot enable due to invalid (post-split) stats
(Sage Weil)
* rado

Re: [ceph-users] Cache tiering

2014-05-07 Thread Dan van der Ster

Hi,

Gregory Farnum wrote:

3) The cost of a cache miss is pretty high, so they should only be
used when the active set fits within the cache and doesn't change too
frequently.


Can you roughly quantify how long a cache miss would take? Naively I'd 
assume it would turn one read into a read from the backing pool, a write 
into the cache pool, then the read from the cache. Is that right?



So, Ceph will not automatically redirect to the base pool in case of
failures; in the general case it*can't*, but you could set up
monitoring to remove a read-only pool if that happens. But in general,
I would only explore cache pools if you expect to periodically pull in
working data sets out of much larger sets of cold data (e.g., jobs run
against a particular bit of scientific data out of your entire
archive).


That's a pity. What would be your hesitation about using WB caching with 
RBD images, assuming the cache pool is sized large enough to match the 
working set.


Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Explicit F2FS support (was: v0.80 Firefly released)

2014-05-07 Thread Sage Weil
On Wed, 7 May 2014, Andrey Korolyov wrote:
> Hello,
> 
> first of all, congratulations to Inktank and thank you for your awesome work!
> 
> Although exploiting native f2fs abilities, as with btrfs, sounds
> awesome for a matter of performance, I wonder when kv db is able to
> practically give users with 'legacy' file systems ability to conduct
> CoW operations as fast as on the log-based fs, with small or no
> performance impact, what`s the primary idea behind introducing
> interface bounded to the specific filesystem in same time? Of course I
> believe that f2fs will outperform almost every competitor at its field
> - non-rotating media operations, but I would be grateful if someone
> can shed light on this development choice.

There are multiple directions to pursue, and they may be more or less 
suitable for different types of workloads.  f2fs is obviously targetted at 
SSD-backed nodes and more general purpose workloads, the main win being 
that we should be able to eliminate the ceph journal entirely.  The 
RocksDB support is a drop-in replacement for leveldb, which we already use 
for lots of OSD metadata.  There is also work in replacing the file-based 
strategy for storing objects entirely in a kv store (whether it is leveldb 
or rocksdb or something else like kinetic or NVMKV).  Some of this will be 
generally useful, and some will work better for specific types of 
workloads.

sage

> 
> On Wed, May 7, 2014 at 5:05 AM, Sage Weil  wrote:
> > We did it!  Firefly v0.80 is built and pushed out to the ceph.com
> > repositories.
> >
> > This release will form the basis for our long-term supported release
> > Firefly, v0.80.x.  The big new features are support for erasure coding
> > and cache tiering, although a broad range of other features, fixes,
> > and improvements have been made across the code base.  Highlights include:
> >
> > * *Erasure coding*: support for a broad range of erasure codes for lower
> >   storage overhead and better data durability.
> > * *Cache tiering*: support for creating 'cache pools' that store hot,
> >   recently accessed objects with automatic demotion of colder data to
> >   a base tier.  Typically the cache pool is backed by faster storage
> >   devices like SSDs.
> > * *Primary affinity*: Ceph now has the ability to skew selection of
> >   OSDs as the "primary" copy, which allows the read workload to be
> >   cheaply skewed away from parts of the cluster without migrating any
> >   data.
> > * *Key/value OSD backend* (experimental): An alternative storage backend
> >   for Ceph OSD processes that puts all data in a key/value database like
> >   leveldb.  This provides better performance for workloads dominated by
> >   key/value operations (like radosgw bucket indices).
> > * *Standalone radosgw* (experimental): The radosgw process can now run
> >   in a standalone mode without an apache (or similar) web server or
> >   fastcgi.  This simplifies deployment and can improve performance.
> >
> > We expect to maintain a series of stable releases based on v0.80
> > Firefly for as much as a year.  In the meantime, development of Ceph
> > continues with the next release, Giant, which will feature work on the
> > CephFS distributed file system, more alternative storage backends
> > (like RocksDB and f2fs), RDMA support, support for pyramid erasure
> > codes, and additional functionality in the block device (RBD) like
> > copy-on-read and multisite mirroring.
> >
> > This release is the culmination of a huge collective effort by about 100
> > different contributors.  Thank you everyone who has helped to make this
> > possible!
> >
> > Upgrade Sequencing
> > --
> >
> > * If your existing cluster is running a version older than v0.67
> >   Dumpling, please first upgrade to the latest Dumpling release before
> >   upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling upgrade`
> >   documentation.
> >
> > * Upgrade daemons in the following order:
> >
> > 1. Monitors
> > 2. OSDs
> > 3. MDSs and/or radosgw
> >
> >   If the ceph-mds daemon is restarted first, it will wait until all
> >   OSDs have been upgraded before finishing its startup sequence.  If
> >   the ceph-mon daemons are not restarted prior to the ceph-osd
> >   daemons, they will not correctly register their new capabilities
> >   with the cluster and new features may not be usable until they are
> >   restarted a second time.
> >
> > * Upgrade radosgw daemons together.  There is a subtle change in behavior
> >   for multipart uploads that prevents a multipart request that was initiated
> >   with a new radosgw from being completed by an old radosgw.
> >
> > Notable changes since v0.79
> > ---
> >
> > * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
> > * ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
> > * librados: fix inconsistencies in API error values (David Zafman)
> > * librados: fix watch operations with cache pools

Re: [ceph-users] Cache tiering

2014-05-07 Thread Gregory Farnum
On Wed, May 7, 2014 at 5:05 AM, Gandalf Corvotempesta
 wrote:
> Very simple question: what happen if server bound to the cache pool goes down?
> For example, a read-only cache could be archived by using a single
> server with no redudancy.
> Is ceph smart enough to detect that cache is unavailable and
> transparently redirect all request to the main pool as usual ?
>
> This allow the usage of one very big server as cache-only. No need for
> redundacy.
>
> Second question: how can I set cache pool to be on a defined OSDs list
> and not distributed across all OSDs ? Should I change the crushmap ? I
> would like to set a single server (and all of its OSDs) for the cache
> pool.

At present, the cache pools are fairly limited in their real-world usefulness.
1) When used in writeback mode, they are the authoritative source for
data, so they must be redundant.
2) When used in readonly mode, they aren't consistent if the
underlying data gets modified.
3) The cost of a cache miss is pretty high, so they should only be
used when the active set fits within the cache and doesn't change too
frequently.

So, Ceph will not automatically redirect to the base pool in case of
failures; in the general case it *can't*, but you could set up
monitoring to remove a read-only pool if that happens. But in general,
I would only explore cache pools if you expect to periodically pull in
working data sets out of much larger sets of cold data (e.g., jobs run
against a particular bit of scientific data out of your entire
archive).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rados-java] Hi, I am a newer for ceph . And I found rados-java in github, but there are some problems for me .

2014-05-07 Thread Wido den Hollander

On 05/05/2014 05:39 AM, peng wrote:

I have installed the latest jdk , ant  set target and source to 1.7 .
But  I always encounter the same error message.




You also need jna-platform.jar to compile rados-java.

I suggest you place that in /usr/share/java as well.

Wido


-- Original --
*From: * "peng";;
*Date: * Sun, May 4, 2014 06:48 PM
*To: * "wido";
*Subject: * [rados-java] Hi, I am a newer for ceph . And I found
rados-java in github,but there are some problems for me .

Hi,
  Firstly , I have to say  rados-java is exactly what I need. It 's
very in demand.
  After I downloaded the source zip,and unzip it, then I input the
command : "ant jar" .
  First , I see it needs Jdk 1.7. but ,I only have jdk1.6 in my
machine. So I edit the build.properties,and change the version to 1.6 .
  Second, I try " ant jar " again . And I found the following
information:

( To make sure jna.jar is right there,  I type :

[root@mon rados-java-master]# ls /usr/share/java/jna.jar
/usr/share/java/jna.jar

  )

 build:
 [javac] Compiling 15 source files to
/root/baipeng/rados-java-master/target/classes
 [javac]
/root/baipeng/rados-java-master/src/main/java/com/ceph/rados/Library.java:39:
cannot find symbol
 [javac] symbol  : method nativeValue(com.sun.jna.Pointer,long)
 [javac] location: class com.sun.jna.Pointer
 [javac] Pointer.nativeValue(ptr, 0L);
 [javac]^
 [javac]
/root/baipeng/rados-java-master/src/main/java/com/ceph/rbd/Library.java:39:
cannot find symbol
 [javac] symbol  : method nativeValue(com.sun.jna.Pointer,long)
 [javac] location: class com.sun.jna.Pointer
 [javac] Pointer.nativeValue(ptr, 0L);
 [javac]^
 [javac] 2 errors
It seems there are something wrong about jna  Appreciate any help :)
Thanks a lot.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Andrei Mikhailovsky
> It's important to disable the rbd cache on tgtd host. Set in 
> /etc/ceph/ceph.conf: 


Should this be done on the iscsi target server? I have a default option to 
enable rbd caching as it speeds things up on the vms. 

Thanks 

Andrei 




- Original Message -

From: "Vlad Gorbunov"  
To: "Sergey Malinin"  
Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com 
Sent: Wednesday, 7 May, 2014 2:23:52 PM 
Subject: Re: [ceph-users] NFS over CEPH - best practice 


It's easy to install tgtd with ceph support. ubuntu 12.04 for example: 
Connect ceph-extras repo: 
echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release -sc) main | 
sudo tee /etc/apt/sources.list.d/ceph-extras.list 


Install tgtd with rbd support: 
apt-get update 
apt-get install tgt 

It's important to disable the rbd cache on tgtd host. Set in 
/etc/ceph/ceph.conf: 
[client] 
rbd_cache = false 

Define permanent export rbd with iscsi in /etc/tgt/targets.conf: 

 
driver iscsi 
bs-type rbd 
backing-store iscsi/volume512 
initiator-address 10.166.18.87 
, wrote: 



http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
 

On Wednesday, May 7, 2014 at 15:06, Andrei Mikhailovsky wrote: 






Vlad, is there a howto somewhere describing the steps on how to setup iscsi 
multipathing over ceph? It looks like a good alternative to nfs 

Thanks 



From: "Vlad Gorbunov" < vadi...@gmail.com > 
To: "Andrei Mikhailovsky" < and...@arhont.com > 
Cc: ceph-users@lists.ceph.com 
Sent: Wednesday, 7 May, 2014 12:02:09 PM 
Subject: Re: [ceph-users] NFS over CEPH - best practice 

For XenServer or VMware is better to use iscsi client to tgtd with ceph 
support. You can install tgtd on osd or monitor server and use multipath for 
failover. 



On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky < and...@arhont.com > 
wrote: 



Hello guys, 

I would like to offer NFS service to the XenServer and VMWare hypervisors for 
storing vm images. I am currently running ceph rbd with kvm, which is working 
reasonably well. 

What would be the best way of running NFS services over CEPH, so that the 
XenServer and VMWare's vm disk images are stored in ceph storage over NFS? 

Many thanks 

Andrei 






___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Vlad Gorbunov
It's easy to install tgtd with ceph support. ubuntu 12.04 for example:


Connect ceph-extras repo:
echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release -sc) main | 
sudo tee /etc/apt/sources.list.d/ceph-extras.list


Install tgtd with rbd support:
apt-get update
apt-get install tgt

It's important to disable the rbd cache on tgtd host. Set in 
/etc/ceph/ceph.conf:
[client]
rbd_cache = false

Define permanent export rbd with iscsi in /etc/tgt/targets.conf:


    driver iscsi
    bs-type rbd
    backing-store iscsi/volume512
    initiator-address 10.166.18.87
, wrote:



http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices




 
On Wednesday, May 7, 2014 at 15:06, Andrei Mikhailovsky wrote:






Vlad, is there a howto somewhere describing the steps on how to setup iscsi 
multipathing over ceph? It looks like a good alternative to nfs

Thanks




From: "Vlad Gorbunov" 
To: "Andrei Mikhailovsky" 
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, 7 May, 2014 12:02:09 PM
Subject: Re: [ceph-users] NFS over CEPH - best practice

For XenServer or VMware is better to use iscsi client to tgtd with ceph 
support. You can install tgtd on osd or monitor server and use multipath for 
failover.


On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky  wrote:


Hello guys,

I would like to offer NFS service to the XenServer and VMWare hypervisors for 
storing vm images. I am currently running ceph rbd with kvm, which is working 
reasonably well.

What would be the best way of running NFS services over CEPH, so that the 
XenServer and VMWare's vm disk images are stored in ceph storage over NFS?

Many thanks

Andrei 












___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Alexandre DERUMIER
>>Do we need a journal when using this back-end? 

no,they are no journal with key value store

- Mail original - 

De: "Kenneth Waegeman"  
À: "Sage Weil"  
Cc: ceph-us...@ceph.com 
Envoyé: Mercredi 7 Mai 2014 15:06:50 
Objet: Re: [ceph-users] v0.80 Firefly released 


- Message from Sage Weil  - 
Date: Tue, 6 May 2014 18:05:19 -0700 (PDT) 
From: Sage Weil  
Subject: [ceph-users] v0.80 Firefly released 
To: ceph-de...@vger.kernel.org, ceph-us...@ceph.com 


> We did it! Firefly v0.80 is built and pushed out to the ceph.com 
> repositories. 
> 
> This release will form the basis for our long-term supported release 
> Firefly, v0.80.x. The big new features are support for erasure coding 
> and cache tiering, although a broad range of other features, fixes, 
> and improvements have been made across the code base. Highlights include: 
> 
> * *Erasure coding*: support for a broad range of erasure codes for lower 
> storage overhead and better data durability. 
> * *Cache tiering*: support for creating 'cache pools' that store hot, 
> recently accessed objects with automatic demotion of colder data to 
> a base tier. Typically the cache pool is backed by faster storage 
> devices like SSDs. 
> * *Primary affinity*: Ceph now has the ability to skew selection of 
> OSDs as the "primary" copy, which allows the read workload to be 
> cheaply skewed away from parts of the cluster without migrating any 
> data. 
> * *Key/value OSD backend* (experimental): An alternative storage backend 
> for Ceph OSD processes that puts all data in a key/value database like 
> leveldb. This provides better performance for workloads dominated by 
> key/value operations (like radosgw bucket indices). 

Nice! 
Is there already some documentation about this Key/value OSD back-end 
topic, like how to use, restrictions, ..? 
A question (referring to 
http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/):
 Do we need a journal when using this 
back-end? 
And is this compatible for use with CephFS? 

Thanks! 


> * *Standalone radosgw* (experimental): The radosgw process can now run 
> in a standalone mode without an apache (or similar) web server or 
> fastcgi. This simplifies deployment and can improve performance. 
> 
> We expect to maintain a series of stable releases based on v0.80 
> Firefly for as much as a year. In the meantime, development of Ceph 
> continues with the next release, Giant, which will feature work on the 
> CephFS distributed file system, more alternative storage backends 
> (like RocksDB and f2fs), RDMA support, support for pyramid erasure 
> codes, and additional functionality in the block device (RBD) like 
> copy-on-read and multisite mirroring. 
> 
> This release is the culmination of a huge collective effort by about 100 
> different contributors. Thank you everyone who has helped to make this 
> possible! 
> 
> Upgrade Sequencing 
> -- 
> 
> * If your existing cluster is running a version older than v0.67 
> Dumpling, please first upgrade to the latest Dumpling release before 
> upgrading to v0.80 Firefly. Please refer to the :ref:`Dumpling upgrade` 
> documentation. 
> 
> * Upgrade daemons in the following order: 
> 
> 1. Monitors 
> 2. OSDs 
> 3. MDSs and/or radosgw 
> 
> If the ceph-mds daemon is restarted first, it will wait until all 
> OSDs have been upgraded before finishing its startup sequence. If 
> the ceph-mon daemons are not restarted prior to the ceph-osd 
> daemons, they will not correctly register their new capabilities 
> with the cluster and new features may not be usable until they are 
> restarted a second time. 
> 
> * Upgrade radosgw daemons together. There is a subtle change in behavior 
> for multipart uploads that prevents a multipart request that was initiated 
> with a new radosgw from being completed by an old radosgw. 
> 
> Notable changes since v0.79 
> --- 
> 
> * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng) 
> * ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng) 
> * librados: fix inconsistencies in API error values (David Zafman) 
> * librados: fix watch operations with cache pools (Sage Weil) 
> * librados: new snap rollback operation (David Zafman) 
> * mds: fix respawn (John Spray) 
> * mds: misc bugs (Yan, Zheng) 
> * mds: misc multi-mds fixes (Yan, Zheng) 
> * mds: use shared_ptr for requests (Greg Farnum) 
> * mon: fix peer feature checks (Sage Weil) 
> * mon: require 'x' mon caps for auth operations (Joao Luis) 
> * mon: shutdown when removed from mon cluster (Joao Luis) 
> * msgr: fix locking bug in authentication (Josh Durgin) 
> * osd: fix bug in journal replay/restart (Sage Weil) 
> * osd: many many many bug fixes with cache tiering (Samuel Just) 
> * osd: track omap and hit_set objects in pg stats (Samuel Just) 
> * osd: warn if agent cannot enable due to invalid (post-split) stats 
> (Sage Weil) 
> * rados bench: track metadata for mu

Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Kenneth Waegeman


- Message from Sage Weil  -
   Date: Tue, 6 May 2014 18:05:19 -0700 (PDT)
   From: Sage Weil 
Subject: [ceph-users] v0.80 Firefly released
 To: ceph-de...@vger.kernel.org, ceph-us...@ceph.com



We did it!  Firefly v0.80 is built and pushed out to the ceph.com
repositories.

This release will form the basis for our long-term supported release
Firefly, v0.80.x.  The big new features are support for erasure coding
and cache tiering, although a broad range of other features, fixes,
and improvements have been made across the code base.  Highlights include:

* *Erasure coding*: support for a broad range of erasure codes for lower
  storage overhead and better data durability.
* *Cache tiering*: support for creating 'cache pools' that store hot,
  recently accessed objects with automatic demotion of colder data to
  a base tier.  Typically the cache pool is backed by faster storage
  devices like SSDs.
* *Primary affinity*: Ceph now has the ability to skew selection of
  OSDs as the "primary" copy, which allows the read workload to be
  cheaply skewed away from parts of the cluster without migrating any
  data.
* *Key/value OSD backend* (experimental): An alternative storage backend
  for Ceph OSD processes that puts all data in a key/value database like
  leveldb.  This provides better performance for workloads dominated by
  key/value operations (like radosgw bucket indices).


Nice!
Is there already some documentation about this Key/value OSD back-end  
topic, like how to use, restrictions, ..?
A question (referring to  
http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/): Do we need a journal when using this  
back-end?

And is this compatible for use with CephFS?

Thanks!



* *Standalone radosgw* (experimental): The radosgw process can now run
  in a standalone mode without an apache (or similar) web server or
  fastcgi.  This simplifies deployment and can improve performance.

We expect to maintain a series of stable releases based on v0.80
Firefly for as much as a year.  In the meantime, development of Ceph
continues with the next release, Giant, which will feature work on the
CephFS distributed file system, more alternative storage backends
(like RocksDB and f2fs), RDMA support, support for pyramid erasure
codes, and additional functionality in the block device (RBD) like
copy-on-read and multisite mirroring.

This release is the culmination of a huge collective effort by about 100
different contributors.  Thank you everyone who has helped to make this
possible!

Upgrade Sequencing
--

* If your existing cluster is running a version older than v0.67
  Dumpling, please first upgrade to the latest Dumpling release before
  upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling upgrade`
  documentation.

* Upgrade daemons in the following order:

1. Monitors
2. OSDs
3. MDSs and/or radosgw

  If the ceph-mds daemon is restarted first, it will wait until all
  OSDs have been upgraded before finishing its startup sequence.  If
  the ceph-mon daemons are not restarted prior to the ceph-osd
  daemons, they will not correctly register their new capabilities
  with the cluster and new features may not be usable until they are
  restarted a second time.

* Upgrade radosgw daemons together.  There is a subtle change in behavior
  for multipart uploads that prevents a multipart request that was initiated
  with a new radosgw from being completed by an old radosgw.

Notable changes since v0.79
---

* ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
* ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
* librados: fix inconsistencies in API error values (David Zafman)
* librados: fix watch operations with cache pools (Sage Weil)
* librados: new snap rollback operation (David Zafman)
* mds: fix respawn (John Spray)
* mds: misc bugs (Yan, Zheng)
* mds: misc multi-mds fixes (Yan, Zheng)
* mds: use shared_ptr for requests (Greg Farnum)
* mon: fix peer feature checks (Sage Weil)
* mon: require 'x' mon caps for auth operations (Joao Luis)
* mon: shutdown when removed from mon cluster (Joao Luis)
* msgr: fix locking bug in authentication (Josh Durgin)
* osd: fix bug in journal replay/restart (Sage Weil)
* osd: many many many bug fixes with cache tiering (Samuel Just)
* osd: track omap and hit_set objects in pg stats (Samuel Just)
* osd: warn if agent cannot enable due to invalid (post-split) stats  
(Sage Weil)

* rados bench: track metadata for multiple runs separately (Guang Yang)
* rgw: fixed subuser modify (Yehuda Sadeh)
* rpm: fix redhat-lsb dependency (Sage Weil, Alfredo Deza)

For the complete release notes, please see:

   http://ceph.com/docs/master/release-notes/#v0-80-firefly


Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.80.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* F

Re: [ceph-users] Cache tiering

2014-05-07 Thread Wido den Hollander

On 05/07/2014 02:05 PM, Gandalf Corvotempesta wrote:

Very simple question: what happen if server bound to the cache pool goes down?
For example, a read-only cache could be archived by using a single
server with no redudancy.
Is ceph smart enough to detect that cache is unavailable and
transparently redirect all request to the main pool as usual ?

This allow the usage of one very big server as cache-only. No need for
redundacy.


Not sure about that :)



Second question: how can I set cache pool to be on a defined OSDs list
and not distributed across all OSDs ? Should I change the crushmap ? I
would like to set a single server (and all of its OSDs) for the cache
pool.


Create a ruleset where only that host/OSD is in and change the 
crush_ruleset setting for the cache pool to use that specific ruleset.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Sergey Malinin
http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
 


On Wednesday, May 7, 2014 at 15:06, Andrei Mikhailovsky wrote:

> 
> Vlad, is there a howto somewhere describing the steps on how to setup iscsi 
> multipathing over ceph? It looks like a good alternative to nfs
> 
> Thanks
> 
> From: "Vlad Gorbunov" mailto:vadi...@gmail.com)>
> To: "Andrei Mikhailovsky" mailto:and...@arhont.com)>
> Cc: ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> Sent: Wednesday, 7 May, 2014 12:02:09 PM
> Subject: Re: [ceph-users] NFS over CEPH - best practice
> 
> For XenServer or VMware is better to use iscsi client to tgtd with ceph 
> support. You can install tgtd on osd or monitor server and use multipath for 
> failover.
> 
> On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky  (mailto:and...@arhont.com)> wrote:
> > Hello guys,
> > 
> > I would like to offer NFS service to the XenServer and VMWare hypervisors 
> > for storing vm images. I am currently running ceph rbd with kvm, which is 
> > working reasonably well.
> > 
> > What would be the best way of running NFS services over CEPH, so that the 
> > XenServer and VMWare's vm disk images are stored in ceph storage over NFS?
> > 
> > Many thanks
> > 
> > Andrei 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Andrei Mikhailovsky


Vlad, is there a howto somewhere describing the steps on how to setup iscsi 
multipathing over ceph? It looks like a good alternative to nfs 

Thanks 

- Original Message -

From: "Vlad Gorbunov"  
To: "Andrei Mikhailovsky"  
Cc: ceph-users@lists.ceph.com 
Sent: Wednesday, 7 May, 2014 12:02:09 PM 
Subject: Re: [ceph-users] NFS over CEPH - best practice 

For XenServer or VMware is better to use iscsi client to tgtd with ceph 
support. You can install tgtd on osd or monitor server and use multipath for 
failover. 



On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky < and...@arhont.com > 
wrote: 



Hello guys, 

I would like to offer NFS service to the XenServer and VMWare hypervisors for 
storing vm images. I am currently running ceph rbd with kvm, which is working 
reasonably well. 

What would be the best way of running NFS services over CEPH, so that the 
XenServer and VMWare's vm disk images are stored in ceph storage over NFS? 

Many thanks 

Andrei 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache tiering

2014-05-07 Thread Gandalf Corvotempesta
Very simple question: what happen if server bound to the cache pool goes down?
For example, a read-only cache could be archived by using a single
server with no redudancy.
Is ceph smart enough to detect that cache is unavailable and
transparently redirect all request to the main pool as usual ?

This allow the usage of one very big server as cache-only. No need for
redundacy.

Second question: how can I set cache pool to be on a defined OSDs list
and not distributed across all OSDs ? Should I change the crushmap ? I
would like to set a single server (and all of its OSDs) for the cache
pool.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Cedric Lemarchand
I am surprised that CephFS isn't proposed as an option, in the way it
removes the not negligible block storage layer from the picture. I
always feel uncomfortable to stack storage technologies or file systems
(here NFS over XFS over iSCSI over RDB over Rados) and try to stay as
possible on the "KISS" way.

Is it because CephFS is still considered as unstable or because MDS
doesn't yet support HA ? (planned for Giant if I remember good).

Le 07/05/2014 12:15, Wido den Hollander a écrit :
> On 05/07/2014 11:46 AM, Andrei Mikhailovsky wrote:
>> Hello guys,
>>
>> I would like to offer NFS service to the XenServer and VMWare
>> hypervisors for storing vm images. I am currently running ceph rbd with
>> kvm, which is working reasonably well.
>>
>> What would be the best way of running NFS services over CEPH, so that
>> the XenServer and VMWare's vm disk images are stored in ceph storage
>> over NFS?
>>
>
> Use kernel RBD, put XFS on it an re-export that with NFS? Would that
> be something that works?
>
> I'd however suggest that you use a recent kernel so that you have a
> new version of krbd. For example Ubuntu 14.04 LTS.
>
>> Many thanks
>>
>> Andrei
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>

-- 
Cédric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Vlad Gorbunov
For XenServer or VMware is better to use iscsi client to tgtd with ceph 
support. You can install tgtd on osd or monitor server and use multipath for 
failover.

On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky 
wrote:

> Hello guys, 
> I would like to offer NFS service to the XenServer and VMWare hypervisors for 
> storing vm images. I am currently running ceph rbd with kvm, which is working 
> reasonably well. 
> What would be the best way of running NFS services over CEPH, so that the 
> XenServer and VMWare's vm disk images are stored in ceph storage over NFS? 
> Many thanks 
> Andrei ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Andrija Panic
Mapping RBD image to 2 or more servers is the same as a shared storage
device (SAN) -  so from there on, you could do any clustering you want,
based on what Wido said...



On 7 May 2014 12:43, Andrei Mikhailovsky  wrote:

>
> Wido, would this work if I were to run nfs over two or more servers with
> virtual IP?
>
> I can see what you've suggested working in a one server setup. What about
> if you want to have two nfs servers in an active/backup or active/active
> setup?
>
> Thanks
>
> Andrei
>
>
> --
> *From: *"Wido den Hollander" 
> *To: *ceph-users@lists.ceph.com
> *Sent: *Wednesday, 7 May, 2014 11:15:39 AM
> *Subject: *Re: [ceph-users] NFS over CEPH - best practice
>
> On 05/07/2014 11:46 AM, Andrei Mikhailovsky wrote:
> > Hello guys,
> >
> > I would like to offer NFS service to the XenServer and VMWare
> > hypervisors for storing vm images. I am currently running ceph rbd with
> > kvm, which is working reasonably well.
> >
> > What would be the best way of running NFS services over CEPH, so that
> > the XenServer and VMWare's vm disk images are stored in ceph storage
> > over NFS?
> >
>
> Use kernel RBD, put XFS on it an re-export that with NFS? Would that be
> something that works?
>
> I'd however suggest that you use a recent kernel so that you have a new
> version of krbd. For example Ubuntu 14.04 LTS.
>
> > Many thanks
> >
> > Andrei
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Andrei Mikhailovsky


Wido, would this work if I were to run nfs over two or more servers with 
virtual IP? 

I can see what you've suggested working in a one server setup. What about if 
you want to have two nfs servers in an active/backup or active/active setup? 

Thanks 

Andrei 


- Original Message -

From: "Wido den Hollander"  
To: ceph-users@lists.ceph.com 
Sent: Wednesday, 7 May, 2014 11:15:39 AM 
Subject: Re: [ceph-users] NFS over CEPH - best practice 

On 05/07/2014 11:46 AM, Andrei Mikhailovsky wrote: 
> Hello guys, 
> 
> I would like to offer NFS service to the XenServer and VMWare 
> hypervisors for storing vm images. I am currently running ceph rbd with 
> kvm, which is working reasonably well. 
> 
> What would be the best way of running NFS services over CEPH, so that 
> the XenServer and VMWare's vm disk images are stored in ceph storage 
> over NFS? 
> 

Use kernel RBD, put XFS on it an re-export that with NFS? Would that be 
something that works? 

I'd however suggest that you use a recent kernel so that you have a new 
version of krbd. For example Ubuntu 14.04 LTS. 

> Many thanks 
> 
> Andrei 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


-- 
Wido den Hollander 
42on B.V. 
Ceph trainer and consultant 

Phone: +31 (0)20 700 9902 
Skype: contact42on 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] advice with hardware configuration

2014-05-07 Thread Christian Balzer
On Wed, 07 May 2014 11:01:33 +0200 Xabier Elkano wrote:

> El 06/05/14 18:40, Christian Balzer escribió:
> > Hello,
> >
> > On Tue, 06 May 2014 17:07:33 +0200 Xabier Elkano wrote:
> >
> >> Hi,
> >>
> >> I'm designing a new ceph pool with new hardware and I would like to
> >> receive some suggestion.
> >> I want to use a replica count of 3 in the pool and the idea is to buy
> >> 3 new servers with a 10-drive 2,5" chassis each and 2 10Gbps nics. I
> >> have in mind two configurations:
> >>
> > As Wido said, more nodes are usually better, unless you're quite aware
> > of what you're doing and why.
> Yes, I know that, but what is the minimum number of nodes to start with?
> Start with three nodes is not a feasible option?

I've started a cluster with 2 nodes and feel that is my case it is a very
feasible option as the OSDs are really RAIDs and thus will basically never
fail and the IO load will be still manageable by one surviving storage
node.

You need to fully understand what happens when you loose one node (and
when it comes back) and if the consequences are acceptable to you. 

That same cluster I've build with 2 high-density nodes would have been 7
lower density nodes if right in the "Ceph" way.

> >  
> >> 1- With journal in SSDs
> >>  
> >> OS: 2xSSD intel SC3500 100G Raid 1
> >> Journal: 2xSSD intel SC3700 100G, 3 journal for each SSD
> > As I wrote just a moment ago, use at least the 200GB ones if
> > performance is such an issue for you.
> > If you can afford it, use 4 3700s and share OS and journal, the OS IOPS
> > will not be that significant, especially if you're using a writeback
> > cache controller. 
> the journal can be shared with the OS, but I like the RAID 1 for the OS.
> I think that the only drawback with it is that I am using two dedicated
> disk slots for OS.

Use software RAID 1 (or 10) on part of the 4 SSDs and put 1-2 journals on
each SSD. Or use 3 SSDs with 7 HDDS and use 2-3 journals on each SSD.
Either way you will have better performance and less impact if a SSD
should fail than with your original design.
A case with 12 drive bays would result in a perfectly equal load
distribution (4 SSDs, 8 HDDs).

As an aside, 3500's are pretty much overkill for OS only, 530s should do
fine.

> >
> >> OSD: 6 SAS10K 900G (SAS2 6Gbps), each running an OSD process. Total
> >> size for OSDs: 5,4TB
> >>
> >> 2- With journal in a partition in the spinners.
> >>
> >> OS: 2xSSD intel SC3500 100G Raid 1
> >> OSD+journal: 8 SAS15K 600G (SAS3 12Gbps), each runing an OSD process
> >> and its journal. Total size for OSDs: 3,6TB
> >>
> > I have no idea why anybody would spend money on 12Gb/s HDDs when even
> > most SSDs have trouble saturating a 6Gb/s link.
> > Given the double write penalty in IOPS, I think you're going to find
> > this more expensive (per byte) and slower than a well rounded option 1.
> But these disks are 2,5" 15K, not only for the link. Other SAS 2,5"
> (SAS2) disks I found are only 10K. The 15K disks should be better for
> random IOPS.
Interesting, I would have thought 15K drives would be available, but all
my spinners are basically consumer stuff. ^o^
Either way, you are wasting the link speed and controller price for an 1/3
increase in IOPS while the double write impact will make the resulting
IOPS per HDD lower than your option 1.

> >
> >> The budget in both configuration is similar, but the total capacity
> >> not. What would be the best configuration from the point of view of
> >> performance? In the second configuration I know the controller write
> >> back cache could be very critical, the servers has a LSI 3108
> >> controller with 2GB Cache. I have to plan this storage as a KVM image
> >> backend and the goal is the performance over the capacity.
> >>
> > Writeback cache can be very helpful, however it is not a miracle cure.
> > Not knowing your actual load and I/O patterns it might very well be
> > enough, though.
> The IO patterns are a bit unknown, I should assume 40% read and 60%
> write, but the IO size is unknown, because the storage is for KVM images
> and the VMs are for many customers and different purposes.

Ah, general purpose KVM. So you might get lucky or totally insane
customers.
Definitely optimize for speed (as in IOPS), monitor things constantly.
Be ready to upgrade your cluster at a moments notice, because once you
reach a threshold it is all downhill from there.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Wido den Hollander

On 05/07/2014 11:46 AM, Andrei Mikhailovsky wrote:

Hello guys,

I would like to offer NFS service to the XenServer and VMWare
hypervisors for storing vm images. I am currently running ceph rbd with
kvm, which is working reasonably well.

What would be the best way of running NFS services over CEPH, so that
the XenServer and VMWare's vm disk images are stored in ceph storage
over NFS?



Use kernel RBD, put XFS on it an re-export that with NFS? Would that be 
something that works?


I'd however suggest that you use a recent kernel so that you have a new 
version of krbd. For example Ubuntu 14.04 LTS.



Many thanks

Andrei


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Andrei Mikhailovsky
Hello guys, 

I would like to offer NFS service to the XenServer and VMWare hypervisors for 
storing vm images. I am currently running ceph rbd with kvm, which is working 
reasonably well. 

What would be the best way of running NFS services over CEPH, so that the 
XenServer and VMWare's vm disk images are stored in ceph storage over NFS? 

Many thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bulk storage use case

2014-05-07 Thread Cedric Lemarchand
Hello,

This build is only intended for archiving purpose, what matter here is
lowering ratio $/To/W.
Access to the storage would be via radosgw, installed on each nodes. I
need that each nodes sustain an average of 1Gb write rates, for which I
think it would not be a problem. Erasure encoding will be used with
something like k=12 m=3.

A typical node would be :

- Supermicro 36 bays
- 2x Xeon E5-2630Lv2
- 96Go ram (recommended ratio 1Go/To for OSD is lowered a bit ... )
- HBA LSI adaptaters, JBOD mode, could be 2x 9207-8i
- 36 HDD 4To with default journals config
- dedicated bonded 2Gb links for public/private networks (backfilling
will takes ages if a full node is lost ...)


I think in an *optimal* state (ceph healthy), it could handle the job.
Waiting for your comment.

What is bothering me more is cases of OSD maintenance operations like
backfilling and cluster re balancing, where nodes will be put under very
hight IO/memory and CPU load during hours/days. Does the latency will
*just* grow up, or does everything will fly away ? (OOMK spawn, OSD
suicides because of latency, node pushed out of the cluster, ect ... )

As you understand I am trying to design the cluster with in mind a sweet
spot like "things becomes slow, latency grow up, but the node stay
stable/usable and aren't pushed out of the cluster".

This is my first jump into Ceph, so any inputs will be greatly
appreciated ;-)

Cheers,

--
Cédric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change size journal's blocks from 4k to another.

2014-05-07 Thread Mike
Hello.

In my Ceph instalation I am uses a ssd drive for journal with direct
access to a block device.

At an osd is started a see in log file string:
...
1 journal _open /dev/sda1 fd 22: 19327352832 bytes, block size 4096
bytes, directio = 1, aio = 1
...

How I can change size of block from 4k to 512k, because my SSD show
better perfomance witch large blocks:
* With 4K (sdr6 - source, sda8 - target)

dd if=/mnt/from/random of=/mnt/sda8/random bs=4k oflag=direct,dsync

iostat show me the statistic:
iostat -cdm 1 /dev/sda /dev/sdr

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda   16198.00 0.00   126.54  0126
sdr 126.0015.75 0.00 15  0


* With 512K (sdr6 - source, sda8 - target)
Sync: sync
Clear cache: echo 1 > /proc/sys/vm/drop_caches
Clear cache LSI controller: megacli -AdpCacheFlush -a0

dd if=/mnt/from/random of=/mnt/sda8/random bs=512k oflag=direct,dsync

iostat show me the statistic:
iostat -cdm 1 /dev/sda /dev/sdr

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda3021.00 0.01   318.00  0318
sdr2410.00   301.25 0.00301  0

I think my cluster have a bottle neck in journal block size. How I can
increase the size of block for journal?

--
Best regards, Mike.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] advice with hardware configuration

2014-05-07 Thread Xabier Elkano
El 06/05/14 18:40, Christian Balzer escribió:
> Hello,
>
> On Tue, 06 May 2014 17:07:33 +0200 Xabier Elkano wrote:
>
>> Hi,
>>
>> I'm designing a new ceph pool with new hardware and I would like to
>> receive some suggestion.
>> I want to use a replica count of 3 in the pool and the idea is to buy 3
>> new servers with a 10-drive 2,5" chassis each and 2 10Gbps nics. I have
>> in mind two configurations:
>>
> As Wido said, more nodes are usually better, unless you're quite aware of
> what you're doing and why.
Yes, I know that, but what is the minimum number of nodes to start with?
Start with three nodes is not a feasible option?
>  
>> 1- With journal in SSDs
>>  
>> OS: 2xSSD intel SC3500 100G Raid 1
>> Journal: 2xSSD intel SC3700 100G, 3 journal for each SSD
> As I wrote just a moment ago, use at least the 200GB ones if performance
> is such an issue for you.
> If you can afford it, use 4 3700s and share OS and journal, the OS IOPS
> will not be that significant, especially if you're using a writeback cache
> controller. 
the journal can be shared with the OS, but I like the RAID 1 for the OS.
I think that the only drawback with it is that I am using two dedicated
disk slots for OS.
>
>> OSD: 6 SAS10K 900G (SAS2 6Gbps), each running an OSD process. Total size
>> for OSDs: 5,4TB
>>
>> 2- With journal in a partition in the spinners.
>>
>> OS: 2xSSD intel SC3500 100G Raid 1
>> OSD+journal: 8 SAS15K 600G (SAS3 12Gbps), each runing an OSD process and
>> its journal. Total size for OSDs: 3,6TB
>>
> I have no idea why anybody would spend money on 12Gb/s HDDs when even
> most SSDs have trouble saturating a 6Gb/s link.
> Given the double write penalty in IOPS, I think you're going to find
> this more expensive (per byte) and slower than a well rounded option 1.
But these disks are 2,5" 15K, not only for the link. Other SAS 2,5"
(SAS2) disks I found are only 10K. The 15K disks should be better for
random IOPS.
>
>> The budget in both configuration is similar, but the total capacity not.
>> What would be the best configuration from the point of view of
>> performance? In the second configuration I know the controller write
>> back cache could be very critical, the servers has a LSI 3108 controller
>> with 2GB Cache. I have to plan this storage as a KVM image backend and
>> the goal is the performance over the capacity.
>>
> Writeback cache can be very helpful, however it is not a miracle cure.
> Not knowing your actual load and I/O patterns it might very well be
> enough, though.
The IO patterns are a bit unknown, I should assume 40% read and 60%
write, but the IO size is unknown, because the storage is for KVM images
and the VMs are for many customers and different purposes.
>
> Regards,
>
> Christian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] advice with hardware configuration

2014-05-07 Thread Xabier Elkano
El 06/05/14 19:38, Sergey Malinin escribió:
> If you plan to scale up in the future you could consider the following config 
> to start with:
>
> Pool size=2
> 3 x servers with OS+journal on 1 ssd, 3 journal ssds, 4 x 900 gb data disks.
> It will get you 5+ TB capacity and you will be able to increase pool size to 
> 3 at some point in time.
Thanks for your response. Do you mean 1 SSD for OS and 3 journal + 4 SAS
900G  + 5 free slots ? I had in mind the OS in RAID 1, but with 2 cheap 
SSD intel 3500 disks. The OS disk are SSD, but not for gaining
performance, they are only 100G and they are cheap. I though that a OS
failure could be worst than a journal or a single OSD failure, because
the recovery time to restore de OS could be higher.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Explicit F2FS support (was: v0.80 Firefly released)

2014-05-07 Thread Andrey Korolyov
Hello,

first of all, congratulations to Inktank and thank you for your awesome work!

Although exploiting native f2fs abilities, as with btrfs, sounds
awesome for a matter of performance, I wonder when kv db is able to
practically give users with 'legacy' file systems ability to conduct
CoW operations as fast as on the log-based fs, with small or no
performance impact, what`s the primary idea behind introducing
interface bounded to the specific filesystem in same time? Of course I
believe that f2fs will outperform almost every competitor at its field
- non-rotating media operations, but I would be grateful if someone
can shed light on this development choice.

On Wed, May 7, 2014 at 5:05 AM, Sage Weil  wrote:
> We did it!  Firefly v0.80 is built and pushed out to the ceph.com
> repositories.
>
> This release will form the basis for our long-term supported release
> Firefly, v0.80.x.  The big new features are support for erasure coding
> and cache tiering, although a broad range of other features, fixes,
> and improvements have been made across the code base.  Highlights include:
>
> * *Erasure coding*: support for a broad range of erasure codes for lower
>   storage overhead and better data durability.
> * *Cache tiering*: support for creating 'cache pools' that store hot,
>   recently accessed objects with automatic demotion of colder data to
>   a base tier.  Typically the cache pool is backed by faster storage
>   devices like SSDs.
> * *Primary affinity*: Ceph now has the ability to skew selection of
>   OSDs as the "primary" copy, which allows the read workload to be
>   cheaply skewed away from parts of the cluster without migrating any
>   data.
> * *Key/value OSD backend* (experimental): An alternative storage backend
>   for Ceph OSD processes that puts all data in a key/value database like
>   leveldb.  This provides better performance for workloads dominated by
>   key/value operations (like radosgw bucket indices).
> * *Standalone radosgw* (experimental): The radosgw process can now run
>   in a standalone mode without an apache (or similar) web server or
>   fastcgi.  This simplifies deployment and can improve performance.
>
> We expect to maintain a series of stable releases based on v0.80
> Firefly for as much as a year.  In the meantime, development of Ceph
> continues with the next release, Giant, which will feature work on the
> CephFS distributed file system, more alternative storage backends
> (like RocksDB and f2fs), RDMA support, support for pyramid erasure
> codes, and additional functionality in the block device (RBD) like
> copy-on-read and multisite mirroring.
>
> This release is the culmination of a huge collective effort by about 100
> different contributors.  Thank you everyone who has helped to make this
> possible!
>
> Upgrade Sequencing
> --
>
> * If your existing cluster is running a version older than v0.67
>   Dumpling, please first upgrade to the latest Dumpling release before
>   upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling upgrade`
>   documentation.
>
> * Upgrade daemons in the following order:
>
> 1. Monitors
> 2. OSDs
> 3. MDSs and/or radosgw
>
>   If the ceph-mds daemon is restarted first, it will wait until all
>   OSDs have been upgraded before finishing its startup sequence.  If
>   the ceph-mon daemons are not restarted prior to the ceph-osd
>   daemons, they will not correctly register their new capabilities
>   with the cluster and new features may not be usable until they are
>   restarted a second time.
>
> * Upgrade radosgw daemons together.  There is a subtle change in behavior
>   for multipart uploads that prevents a multipart request that was initiated
>   with a new radosgw from being completed by an old radosgw.
>
> Notable changes since v0.79
> ---
>
> * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
> * ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
> * librados: fix inconsistencies in API error values (David Zafman)
> * librados: fix watch operations with cache pools (Sage Weil)
> * librados: new snap rollback operation (David Zafman)
> * mds: fix respawn (John Spray)
> * mds: misc bugs (Yan, Zheng)
> * mds: misc multi-mds fixes (Yan, Zheng)
> * mds: use shared_ptr for requests (Greg Farnum)
> * mon: fix peer feature checks (Sage Weil)
> * mon: require 'x' mon caps for auth operations (Joao Luis)
> * mon: shutdown when removed from mon cluster (Joao Luis)
> * msgr: fix locking bug in authentication (Josh Durgin)
> * osd: fix bug in journal replay/restart (Sage Weil)
> * osd: many many many bug fixes with cache tiering (Samuel Just)
> * osd: track omap and hit_set objects in pg stats (Samuel Just)
> * osd: warn if agent cannot enable due to invalid (post-split) stats (Sage 
> Weil)
> * rados bench: track metadata for multiple runs separately (Guang Yang)
> * rgw: fixed subuser modify (Yehuda Sadeh)
> * rpm: fix redhat-lsb dependency (Sage We

Re: [ceph-users] advice with hardware configuration

2014-05-07 Thread Xabier Elkano
El 06/05/14 19:31, Cedric Lemarchand escribió:
> Le 06/05/2014 17:07, Xabier Elkano a écrit :
>> the goal is the performance over the capacity.
> I am sure you already consider the "full SSD" option, did you ?
>
Yes, I considered full SSD option, but it is very expensive. Using intel
520 series each disk costs double than a SAS equivalent.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to install CEPH on CentOS 6.3

2014-05-07 Thread Ease Lu
Hi All,
 As following the CEPH online document, I tried to install CEPH on
centos 6.3:

 The step: ADD CEPH
  I cannot find centos distro, so I used el6. when I reach the "INTSALL
VIRTUALIZATION FOR BLOCK DEVICE" step, I got:

Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
   Requires: libusbredirparser.so.1()(64bit)
Error: Package: 2:qemu-img-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
   Requires: libusbredirparser.so.1()(64bit)
Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
   Requires: libspice-server.so.1(SPICE_SERVER_0.11.2)(64bit)
Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
   Requires: libspice-server.so.1(SPICE_SERVER_0.12.4)(64bit)
Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)
   Requires: seabios >= 0.6.1.2-20.el6
   Available: seabios-0.6.1.2-19.el6.x86_64 (qa_os_centos6.3_64)
   seabios = 0.6.1.2-19.el6
Error: Package: 2:qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 (ceph-extras)


 Would you please tell me how to resolve the issue?

Best Regards,
Jack
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pool .rgw.bucket and objects within it

2014-05-07 Thread Irek Fasikhov
Yes, delete all the objects stored in the pool.


2014-05-07 6:58 GMT+04:00 Thanh Tran :

> Hi,
>
> If i use command "ceph osd pool delete .rgw.bucket .rgw.bucket
> --yes-i-really-really-mean-it" to delete the pool .rgw.bucket, will this
> delete the pool, its objects and clean the data on osds?
>
> Best regards,
> Thanh Tran
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com