Re: [ceph-users] rbd cache + libvirt

2015-06-09 Thread Andrey Korolyov
On Tue, Jun 9, 2015 at 7:59 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
 host conf : rbd_cache=true   : guest cache=none  : result : cache (wrong)


Thanks Alexandre, so you are confirming that this exact case misbehaves?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache + libvirt

2015-06-09 Thread Alexandre DERUMIER
Thanks Alexandre, so you are confirming that this exact case misbehaves?

The rbd_cache value from ceph.conf always override the cache value from qemu.

My personnal opinion is this is wrong. qemu value should overrive the ceph.conf 
value.

I don't known what happen in a live migration for example, if rbd_cache in 
ceph.conf is different on source and target host ?


- Mail original -
De: Andrey Korolyov and...@xdel.ru
À: aderumier aderum...@odiso.com
Cc: Jason Dillaman dilla...@redhat.com, ceph-users 
ceph-users@lists.ceph.com
Envoyé: Mardi 9 Juin 2015 10:36:09
Objet: Re: [ceph-users] rbd cache + libvirt

On Tue, Jun 9, 2015 at 7:59 AM, Alexandre DERUMIER aderum...@odiso.com wrote: 
 host conf : rbd_cache=true : guest cache=none : result : cache (wrong) 


Thanks Alexandre, so you are confirming that this exact case misbehaves? 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache + libvirt

2015-06-09 Thread Daniel Swarbrick
I presume that since QEMU 1.2+ sets the default cache mode to writeback
if not otherwise specified, and since giant sets rbd_cache to true if
not otherwise specified, then the result should be to cache?

We have a fair number of VMs running on hosts where we don't specify
either explicitly, and I've always had the feeling that its _not_
caching... saving a small text file in the VM (e.g. with vim) always
seems to take much longer than it should - but I wonder if that's
because vim is doing an fsync().

If I understand the QEMU docs correctly, cache=unsafe would immediately
ack the guest's fsync() - at the risk of data losss if the QEMU process
crashes.

On 09/06/15 06:59, Alexandre DERUMIER wrote:
 oops, sorry, my bad, I had wrong settings when testing.
 
 you are right, remove rbd_cache from ceph.conf is enough to remove overloading
 
 
 
 host conf : no value : guest cache=writeback : result : cache 
 host conf : rbd_cache=false  : guest cache=writeback : result : nocache 
 (wrong) 
 host conf : rbd_cache=true   : guest cache=writeback : result : cache 
 host conf : no value : guest cache=none  : result : nocache
 host conf : rbd_cache=false  : guest cache=none  : result : no cache 
 host conf : rbd_cache=true   : guest cache=none  : result : cache (wrong) 
 
 
 - Mail original -
 De: aderumier aderum...@odiso.com
 À: Jason Dillaman dilla...@redhat.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Envoyé: Mardi 9 Juin 2015 06:33:49
 Objet: Re: [ceph-users] rbd cache + libvirt
 
 previous matrix was with ceph  giant 
 
 
 with ceph =giant, rbd_cache=true by default, so cache=none not working if a 
 ceph.conf exist. 
 
 
 host conf : no value : guest cache=writeback : result : cache 
 host conf : rbd_cache=false : guest cache=writeback : result : nocache 
 (wrong) 
 host conf : rbd_cache=true : guest cache=writeback : result : cache 
 host conf : no value : guest cache=none : result : cache (wrong) 
 host conf : rbd_cache=false : guest cache=none : result : no cache 
 host conf : rbd_cache=true : guest cache=none : result : cache (wrong) 
 
 
 - Mail original - 
 De: aderumier aderum...@odiso.com 
 À: Jason Dillaman dilla...@redhat.com 
 Cc: ceph-users ceph-users@lists.ceph.com 
 Envoyé: Mardi 9 Juin 2015 06:23:06 
 Objet: Re: [ceph-users] rbd cache + libvirt 
 
 In the short-term, you can remove the rbd cache setting from your 
 ceph.conf 
 
 That's not true, you need to remove the ceph.conf file. 
 Removing rbd_cache is not enough or default rbd_cache=false will apply. 
 
 
 I have done tests, here the result matrix 
 
 
 host ceph.conf : no rbd_cache : guest cache=writeback : result : nocache 
 (wrong) 
 host ceph.conf : rbd_cache=false : guest cache=writeback : result : nocache 
 (wrong) 
 host ceph.conf : rbd_cache=true : guest cache=writeback : result : cache 
 host ceph.conf : no rbd_cache : guest cache=none : result : nocache 
 host ceph.conf : rbd_cache=false : guest cache=none : result : no cache 
 host ceph.conf : rbd_cache=true : guest cache=none : result : cache (wrong) 
 
 
 
 - Mail original - 
 De: Jason Dillaman dilla...@redhat.com 
 À: Andrey Korolyov and...@xdel.ru 
 Cc: Josh Durgin jdur...@redhat.com, aderumier aderum...@odiso.com, 
 ceph-users ceph-users@lists.ceph.com 
 Envoyé: Lundi 8 Juin 2015 22:29:10 
 Objet: Re: [ceph-users] rbd cache + libvirt 
 
 On Mon, Jun 8, 2015 at 10:43 PM, Josh Durgin jdur...@redhat.com wrote: 
 On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote: 

 Hi, 

 looking at the latest version of QEMU, 


 It's seem that it's was already this behaviour since the add of rbd_cache 
 parsing in rbd.c by josh in 2012 


 http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85
  


 I'll do tests on my side tomorrow to be sure. 


 It seems like we should switch the order so ceph.conf is overridden by 
 qemu's cache settings. I don't remember a good reason to have it the 
 other way around. 

 Josh 


 Erm, doesn`t this code *already* represent the right priorities? 
 Cache=none setting should set a BDRV_O_NOCACHE which is effectively 
 disabling cache in a mentioned snippet. 

 
 Yes, the override is applied (correctly) based upon your QEMU cache settings. 
 However, it then reads your configuration file and re-applies the rbd_cache 
 setting based upon what is in the file (if it exists). So in the case where a 
 configuration file has rbd cache = true, the override of rbd cache = 
 false derived from your QEMU cache setting would get wiped out. The long 
 term solution would be to, as Josh noted, switch the order (so long as there 
 wasn't a use-case for applying values in this order). In the short-term, you 
 can remove the rbd cache setting from your ceph.conf so that QEMU controls 
 it (i.e. it cannot get overridden when reading the 

Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
It's seem that the limit is mainly going in high queue depth (+-  16)

Here the result in iops with 1client- 4krandread- 3osd - with differents queue 
depth size.
rbd_cache is almost the same than without cache with queue depth 16


cache
-
qd1: 1651
qd2: 3482
qd4: 7958
qd8: 17912
qd16: 36020
qd32: 42765
qd64: 46169

no cache

qd1: 1748
qd2: 3570
qd4: 8356
qd8: 17732
qd16: 41396
qd32: 78633
qd64: 79063
qd128: 79550


- Mail original -
De: aderumier aderum...@odiso.com
À: pushpesh sharma pushpesh@gmail.com
Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
ceph-users@lists.ceph.com
Envoyé: Mardi 9 Juin 2015 09:28:21
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi, 

 We tried adding more RBDs to single VM, but no luck. 

If you want to scale with more disks in a single qemu vm, you need to use 
iothread feature from qemu and assign 1 iothread by disk (works with 
virtio-blk). 
It's working for me, I can scale with adding more disks. 


My bench here are done with fio-rbd on host. 
I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and 
around 250kiops 10clients-rbdcache=on. 


I just wonder why I don't have performance decrease around 30k iops with 1osd. 

I'm going to see if this tracker 
http://tracker.ceph.com/issues/11056 

could be the cause. 

(My master build was done some week ago) 



- Mail original - 
De: pushpesh sharma pushpesh@gmail.com 
À: aderumier aderum...@odiso.com 
Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
ceph-users@lists.ceph.com 
Envoyé: Mardi 9 Juin 2015 09:21:04 
Objet: Re: rbd_cache, limiting read on high iops around 40k 

Hi Alexandre, 

We have also seen something very similar on Hammer(0.94-1). We were doing some 
benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each 
Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some 
strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. 
We tried adding more RBDs to single VM, but no luck. However increasing number 
of VMs to 4 on a single hypervisor did scale to some extent. After this there 
was no much benefit we got from adding more VMs. 

Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor 
has 4 VM, each VM has 1 RBD:- 




VDbench is used as benchmarking tool. We were not saturating network and CPUs 
at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is 
where we were suspecting of some throttling effect. However we haven't setted 
any such limits from nova or kvm end. We tried some CPU pinning and other KVM 
related tuning as well, but no luck. 

We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling 
from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that 
point the numbers were actually degrading. (Single pipe more congestion effect) 

We never suspected that rbd cache enable could be detrimental to performance. 
It would nice to route cause the problem if that is the case. 

On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER  aderum...@odiso.com  
wrote: 


Hi, 

I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
and rbd_cache=true seem to limit the iops around 40k 


no cache 
 
1 client - rbd_cache=false - 1osd : 38300 iops 
1 client - rbd_cache=false - 2osd : 69073 iops 
1 client - rbd_cache=false - 3osd : 78292 iops 


cache 
- 
1 client - rbd_cache=true - 1osd : 38100 iops 
1 client - rbd_cache=true - 2osd : 42457 iops 
1 client - rbd_cache=true - 3osd : 45823 iops 



Is it expected ? 



fio result rbd_cache=false 3 osd 
 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 
00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 
2015 
read : io=1MB, bw=313169KB/s, iops=78292, runt= 32698msec 
slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
clat percentiles (usec): 
| 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
| 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
| 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
| 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
| 99.99th=[ 1176] 
bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
lat (msec) : 2=0.03%, 4=0.01% 
cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, =64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% 
complete : 0=0.0%, 4=91.6%, 8=3.4%, 

Re: [ceph-users] rbd format v2 support

2015-06-09 Thread Ilya Dryomov
On Tue, Jun 9, 2015 at 5:52 AM, David Z david.z1...@yahoo.com wrote:
 Hi Ilya,

 Thanks for the reply. I knew that v2 image can be mapped if using default
 striping parameters without --stripe-unit or --stripe-count.

 It is just the rbd performance (IOPS  bandwidth) we tested hasn't met our
 goal. We found at this point OSDs seemed not to be the bottleneck, so we
 want to try fancy striping.

 Do you know if there is an approximate ETA for this feature? Or it would be
 great that you could share some info on tuning rbd performance. Anything
 will be appreciated.

Your config, workload and some numbers would be helpful in getting
others to chime in.  If the bottleneck seems to be the kernel client
try perf to see where the time gets spent.  If it's a ceph-msgr kworker
then increasing the number of OSDs could help.  I doubt lowering stripe
size to 8k would have helped here.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache + libvirt

2015-06-09 Thread Andrey Korolyov
On Tue, Jun 9, 2015 at 11:51 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
Thanks Alexandre, so you are confirming that this exact case misbehaves?

 The rbd_cache value from ceph.conf always override the cache value from qemu.

 My personnal opinion is this is wrong. qemu value should overrive the 
 ceph.conf value.

 I don't known what happen in a live migration for example, if rbd_cache in 
 ceph.conf is different on source and target host ?




Yes, you are right. The destination process in a live migration
behaves as an independently launched copy, it does not inherit those
kind of parameters from source emulator.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
Frankly, I'm a little impressed that without RBD cache we can hit 80K 
IOPS from 1 VM!

Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll 
have overhead.
(I'm planning to send results in qemu soon)

How fast are the SSDs in those 3 OSDs? 

Theses results are with datas in buffer memory of osd nodes.

When reading fulling on ssd (intel s3500),

For 1 client, 

I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
I'm around 55k iops without cache and 38k iops with cache, with 3 osd.

with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops 
by osd when datas are in buffer.

(cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)



small tip : 
I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%

LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...

as a lot of time is spent in malloc/free 


(qemu support also tcmalloc since some months , I'll bench it too
  https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html)



I'll try to send full bench results soon, from 1 to 18 ssd osd.




- Mail original -
De: Mark Nelson mnel...@redhat.com
À: aderumier aderum...@odiso.com, pushpesh sharma pushpesh@gmail.com
Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
ceph-users@lists.ceph.com
Envoyé: Mardi 9 Juin 2015 13:36:31
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi All, 

In the past we've hit some performance issues with RBD cache that we've 
fixed, but we've never really tried pushing a single VM beyond 40+K read 
IOPS in testing (or at least I never have). I suspect there's a couple 
of possibilities as to why it might be slower, but perhaps joshd can 
chime in as he's more familiar with what that code looks like. 

Frankly, I'm a little impressed that without RBD cache we can hit 80K 
IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 

Mark 

On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
 It's seem that the limit is mainly going in high queue depth (+-  16) 
 
 Here the result in iops with 1client- 4krandread- 3osd - with differents 
 queue depth size. 
 rbd_cache is almost the same than without cache with queue depth 16 
 
 
 cache 
 - 
 qd1: 1651 
 qd2: 3482 
 qd4: 7958 
 qd8: 17912 
 qd16: 36020 
 qd32: 42765 
 qd64: 46169 
 
 no cache 
  
 qd1: 1748 
 qd2: 3570 
 qd4: 8356 
 qd8: 17732 
 qd16: 41396 
 qd32: 78633 
 qd64: 79063 
 qd128: 79550 
 
 
 - Mail original - 
 De: aderumier aderum...@odiso.com 
 À: pushpesh sharma pushpesh@gmail.com 
 Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
 ceph-users@lists.ceph.com 
 Envoyé: Mardi 9 Juin 2015 09:28:21 
 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
 
 Hi, 
 
 We tried adding more RBDs to single VM, but no luck. 
 
 If you want to scale with more disks in a single qemu vm, you need to use 
 iothread feature from qemu and assign 1 iothread by disk (works with 
 virtio-blk). 
 It's working for me, I can scale with adding more disks. 
 
 
 My bench here are done with fio-rbd on host. 
 I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and 
 around 250kiops 10clients-rbdcache=on. 
 
 
 I just wonder why I don't have performance decrease around 30k iops with 
 1osd. 
 
 I'm going to see if this tracker 
 http://tracker.ceph.com/issues/11056 
 
 could be the cause. 
 
 (My master build was done some week ago) 
 
 
 
 - Mail original - 
 De: pushpesh sharma pushpesh@gmail.com 
 À: aderumier aderum...@odiso.com 
 Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
 ceph-users@lists.ceph.com 
 Envoyé: Mardi 9 Juin 2015 09:21:04 
 Objet: Re: rbd_cache, limiting read on high iops around 40k 
 
 Hi Alexandre, 
 
 We have also seen something very similar on Hammer(0.94-1). We were doing 
 some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). 
 Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For 
 some strange reason it was not able to scale 4K- RR iops on each VM beyond 
 35-40k. We tried adding more RBDs to single VM, but no luck. However 
 increasing number of VMs to 4 on a single hypervisor did scale to some 
 extent. After this there was no much benefit we got from adding more VMs. 
 
 Here is the trend we have seen, x-axis is number of hypervisor, each 
 hypervisor has 4 VM, each VM has 1 RBD:- 
 
 
 
 
 VDbench is used as benchmarking tool. We were not saturating network and CPUs 
 at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is 
 where we were suspecting of some throttling effect. However we haven't setted 
 any such limits from nova or kvm end. We tried some CPU pinning and other KVM 
 related tuning as well, but no luck. 
 
 We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling 
 from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling 

Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread pushpesh sharma
Hi Alexandre,

We have also seen something very similar on Hammer(0.94-1). We were doing
some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno).
Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For
some strange reason it was not able to scale 4K- RR iops on each VM beyond
35-40k. We tried adding more RBDs to single VM, but no luck. However
increasing number of VMs to 4 on a single hypervisor did scale to some
extent. After this there was no much benefit we got from adding more VMs.

Here is the trend we have seen, x-axis is number of hypervisor, each
hypervisor has 4 VM, each VM has 1 RBD:-



​
 VDbench is used as benchmarking tool. We were not saturating network and
CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and
that is where we were suspecting of some throttling effect. However  we
haven't setted any such limits from nova or kvm end. We tried some CPU
pinning and other KVM related tuning as well, but no luck.

We tried the same experiment on a bare metal. It was 4K RR IOPs were
scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling
beyond that point the numbers were actually degrading. (Single pipe more
congestion effect)

We never suspected that rbd cache enable could be detrimental to
performance. It would nice to route cause the problem if that is the case.


On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER aderum...@odiso.com
wrote:

 Hi,

 I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
 and rbd_cache=true seem to limit the iops around 40k


 no cache
 
 1 client - rbd_cache=false - 1osd : 38300 iops
 1 client - rbd_cache=false - 2osd : 69073 iops
 1 client - rbd_cache=false - 3osd : 78292 iops


 cache
 -
 1 client - rbd_cache=true - 1osd : 38100 iops
 1 client - rbd_cache=true - 2osd : 42457 iops
 1 client - rbd_cache=true - 3osd : 45823 iops



 Is it expected ?



 fio result rbd_cache=false 3 osd
 
 rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
 ioengine=rbd, iodepth=32
 fio-2.1.11
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops]
 [eta 00m:00s]
 rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun  9
 07:48:42 2015
   read : io=1MB, bw=313169KB/s, iops=78292, runt= 32698msec
 slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
 clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
  lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
 clat percentiles (usec):
  |  1.00th=[  173],  5.00th=[  209], 10.00th=[  231], 20.00th=[  262],
  | 30.00th=[  282], 40.00th=[  302], 50.00th=[  322], 60.00th=[  346],
  | 70.00th=[  370], 80.00th=[  402], 90.00th=[  454], 95.00th=[  506],
  | 99.00th=[  628], 99.50th=[  692], 99.90th=[  860], 99.95th=[  948],
  | 99.99th=[ 1176]
 bw (KB  /s): min=238856, max=360448, per=100.00%, avg=313402.34,
 stdev=25196.21
 lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
 lat (msec) : 2=0.03%, 4=0.01%
   cpu  : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%,
 =64=0.0%
  issued: total=r=256/w=0/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
READ: io=1MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s,
 mint=32698msec, maxt=32698msec

 Disk stats (read/write):
 dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
 aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
   sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%




 fio result rbd_cache=true 3osd
 --

 rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
 ioengine=rbd, iodepth=32
 fio-2.1.11
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops]
 [eta 00m:00s]
 rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun  9
 07:47:30 2015
   read : io=1MB, bw=183296KB/s, iops=45823, runt= 55866msec
 slat (usec): min=7, max=805, avg=21.26, stdev=15.84
 clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
  lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
 clat percentiles (usec):
  |  1.00th=[  227],  5.00th=[  274], 10.00th=[  306], 20.00th=[  350],
  | 30.00th=[  390], 40.00th=[  430], 50.00th=[  470], 60.00th=[  506],
  | 70.00th=[  548], 80.00th=[  596], 90.00th=[  660], 95.00th=[  724],
  | 99.00th=[  844], 99.50th=[  908], 99.90th=[ 1112], 99.95th=[ 1288],
  | 99.99th=[ 2192]
 bw (KB  /s): min=115280, max=204416, 

Re: [ceph-users] Blueprint Submission Open for CDS Jewel

2015-06-09 Thread Shishir Gowda
Hi Patrick,

Facing 403 error while trying to upload the blueprint.

With regards,
Shishir

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Haomai 
Wang
Sent: Monday, June 08, 2015 10:16 PM
To: Patrick McGarry
Cc: Ceph Devel; Ceph-User; ceph-annou...@ceph.com
Subject: Re: [ceph-users] Blueprint Submission Open for CDS Jewel

Hi Partick,

It looks confusing to use this. Is it need that we upload a txt file to 
describe blueprint instead of editing directly online?

On Wed, May 27, 2015 at 5:05 AM, Patrick McGarry pmcga...@redhat.com wrote:
 It's that time again, time to gird up our loins and submit blueprints
 for all work slated for the Jewel release of Ceph.

 http://ceph.com/uncategorized/ceph-developer-summit-jewel/

 The one notable change for this CDS is that we'll be using the new
 wiki (on tracker.ceph.com) that is still undergoing migration from the
 old wiki. I have outlined the procedure in the announcement above, but
 please feel free to hit me with any questions or issues you may have.
 Thanks.


 --

 Best Regards,

 Patrick McGarry
 Director Ceph Community || Red Hat
 http://ceph.com  ||  http://community.redhat.com @scuttlemonkey ||
 @ceph ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
Hi,

 We tried adding more RBDs to single VM, but no luck.

If you want to scale with more disks in a single qemu vm, you need to use 
iothread feature from qemu and assign 1 iothread by disk (works with 
virtio-blk).
It's working for me, I can scale with adding more disks.


My bench here are done with fio-rbd on host.
I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and 
around 250kiops 10clients-rbdcache=on.


I just wonder why I don't have performance decrease around 30k iops with 1osd.

I'm going to see if this tracker
http://tracker.ceph.com/issues/11056

could be the cause.

(My master build was done some week ago)



- Mail original -
De: pushpesh sharma pushpesh@gmail.com
À: aderumier aderum...@odiso.com
Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
ceph-users@lists.ceph.com
Envoyé: Mardi 9 Juin 2015 09:21:04
Objet: Re: rbd_cache, limiting read on high iops around 40k

Hi Alexandre, 

We have also seen something very similar on Hammer(0.94-1). We were doing some 
benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each 
Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some 
strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. 
We tried adding more RBDs to single VM, but no luck. However increasing number 
of VMs to 4 on a single hypervisor did scale to some extent. After this there 
was no much benefit we got from adding more VMs. 

Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor 
has 4 VM, each VM has 1 RBD:- 



 
VDbench is used as benchmarking tool. We were not saturating network and CPUs 
at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is 
where we were suspecting of some throttling effect. However we haven't setted 
any such limits from nova or kvm end. We tried some CPU pinning and other KVM 
related tuning as well, but no luck. 

We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling 
from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that 
point the numbers were actually degrading. (Single pipe more congestion effect) 

We never suspected that rbd cache enable could be detrimental to performance. 
It would nice to route cause the problem if that is the case. 

On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER  aderum...@odiso.com  
wrote: 


Hi, 

I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
and rbd_cache=true seem to limit the iops around 40k 


no cache 
 
1 client - rbd_cache=false - 1osd : 38300 iops 
1 client - rbd_cache=false - 2osd : 69073 iops 
1 client - rbd_cache=false - 3osd : 78292 iops 


cache 
- 
1 client - rbd_cache=true - 1osd : 38100 iops 
1 client - rbd_cache=true - 2osd : 42457 iops 
1 client - rbd_cache=true - 3osd : 45823 iops 



Is it expected ? 



fio result rbd_cache=false 3 osd 
 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 
00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 
2015 
read : io=1MB, bw=313169KB/s, iops=78292, runt= 32698msec 
slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
clat percentiles (usec): 
| 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
| 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
| 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
| 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
| 99.99th=[ 1176] 
bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
lat (msec) : 2=0.03%, 4=0.01% 
cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, =64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% 
complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, =64=0.0% 
issued : total=r=256/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=1MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, 
mint=32698msec, maxt=32698msec 

Disk stats (read/write): 
dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, 
aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 




fio result rbd_cache=true 3osd 
-- 

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 

Re: [ceph-users] Multiple journals and an OSD on one SSD doable?

2015-06-09 Thread koukou73gr
On 06/08/2015 11:54 AM, Jan Schermer wrote:
 
 This should indicate the real wear:   100 Gigabytes_Erased
 0x0032   000   000   000Old_age   Always   -   62936
 Bytes written after compression:  233 SandForce_Internal  
 0x   000   000   000Old_age   Offline  -   40464
 Written bytes from OS perspective: 241 Lifetime_Writes_GiB 0x0032   
 000   000   000Old_age   Always   -   53826
 
 I wonder if it’s “write-mostly” for everyone… :)
 242 Lifetime_Reads_GiB  0x0032   000   000   000Old_age   Always  
  -   13085

LOL...

241 Lifetime_Writes_GiB -O--CK   000   000   000-10782
242 Lifetime_Reads_GiB  -O--CK   000   000   000-50

SSD contains 2x10GB journals partitions for 2x4TB OSD + 1x20GB for OS.

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginners ceph journal question

2015-06-09 Thread Vickey Singh
Thanks Michael for your response.

Could you also please help in understanding

#1  On my ceph cluster , how can i confirm if journal is on block device
partition or on file ?

#2  Is it true that by default ceph-deploy creates journal on dedicated
partition and data on another partition if i use the command  ceph-deploy
osd create ceph-node1:/dev/sdb

I want to understand the concept of journals creation in Ceph. Hope you
will help me.

- vicky

On Tue, Jun 9, 2015 at 5:28 PM, Michael Kuriger mk7...@yp.com wrote:

   You could mount /dev/sdb to a filesystem, such as /ceph-disk, and then
 do this:
 ceph-deploy osd create ceph-node1:/ceph-disk

  Your journal would be a file doing it this way.



 [image: yp]



 Michael Kuriger

 Sr. Unix Systems Engineer

 * mk7...@yp.com |( 818-649-7235

   From: Vickey Singh vickey.singh22...@gmail.com
 Date: Tuesday, June 9, 2015 at 12:21 AM
 To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
 Subject: [ceph-users] Beginners ceph journal question

   Hello Cephers

  Beginners question on Ceph Journals creation. Need answers from experts.

  - Is it true that by default ceph-deploy creates journal on dedicated
 partition and data on another partition. It does not creates journal on
 file ??

  ceph-deploy osd create ceph-node1:/dev/sdb

  This commands is creating
 data partition : /dev/sdb2
 Journal Partition : /dev/sdb1

  In ceph-deploy command i have not specified journal partition but still
 it creates a journal on sdb1 ?

  - How can i confirm if journal is on block device partition or on file ?

  - How can i create journal on a file ? command would be helpful ?

  Regards
 Vicky

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginners ceph journal question

2015-06-09 Thread Michael Kuriger
You could mount /dev/sdb to a filesystem, such as /ceph-disk, and then do this:
ceph-deploy osd create ceph-node1:/ceph-disk

Your journal would be a file doing it this way.


[yp]



Michael Kuriger
Sr. Unix Systems Engineer
* mk7...@yp.commailto:mk7...@yp.com |* 818-649-7235


From: Vickey Singh 
vickey.singh22...@gmail.commailto:vickey.singh22...@gmail.com
Date: Tuesday, June 9, 2015 at 12:21 AM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] Beginners ceph journal question

Hello Cephers

Beginners question on Ceph Journals creation. Need answers from experts.

- Is it true that by default ceph-deploy creates journal on dedicated partition 
and data on another partition. It does not creates journal on file ??

ceph-deploy osd create ceph-node1:/dev/sdb

This commands is creating
data partition : /dev/sdb2
Journal Partition : /dev/sdb1

In ceph-deploy command i have not specified journal partition but still it 
creates a journal on sdb1 ?

- How can i confirm if journal is on block device partition or on file ?

- How can i create journal on a file ? command would be helpful ?

Regards
Vicky
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

2015-06-09 Thread kevin parrikar
I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also
hosting 3 monitoring process) with default replica 3.

Total OSD disks : 16
Total Nodes : 4

How can i calculate the

   - Maximum number of disk failures my cluster can handle with out  any
   impact on current data and new writes.
   - Maximum number of node failures  my cluster can handle with out any
   impact on current data and new writes.

Thanks for any help
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginners ceph journal question

2015-06-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

#1 `readlink /var/lib/ceph/osd/-/journal` If it returns nothing then
it is a file, if it returns something it is a partition.

#2 It appears that the default of ceph-deploy is to create and use a
partition (I don't use ceph-deploy, so I can't be authoritative). This
makes sense as there is a lot of overhead in a file system and a file
and by using a raw partition, it cuts that all out and gives better
performance.

All writes land at the journal first, then are later flushed to the
XFS/EXT4 file system. This helps turn random I/O into 'less'-random
I/O. You can see big performance gains by locating the journal on a
partition of a quality SSD (there are plenty of threads about
quality SSD).
-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVdw89CRDmVDuy+mK58QAAuWkP/1FqhX1DhZXKOIfMf8Of
6H0Y3Qi+eCWmpcteOZIK+fF3ZXXHsAp2hU+ON8TWjLZXEmPQj49tuJfkQfdj
DYO4s1k2Bc5D3PP5UsIM9t76rJuJfBXk7i8816GIvwKgiDDiVnNG4UZut5xW
TzpOwpq8vPB3E3cWI/Zg7W4R61D7SbtIzrdu6LVaCatLmHMIM2Aj4Zgdjlqz
1+1t2oy+8NmTr8OqDeZCV+9HdsPWtVkhmcba9CJvT1MYHrXvhMVBKZcso0mb
bzS9xxw0FfyX7cDZQUsduP/kR2u6yEj6vyXw2sRGBDAj6Cx6Zc2YcfiDi/cJ
hOmN+8nRo4D35O97qFtVGPrcJJjB4titoA/yfTslqzo89omkOCx1ZycYlwzq
xvxuxfY+XznwvpEu40AayPY5e1RqIeY9ntRWt3rTScOY/J3xj7BgawA+Y8rp
QVoFyzp+/sQBkTuiVivknUYUDpVXpwng2YLVl9hrjhZfAW/orLiIm/ztq2ro
HnrCeyMpGICBdptUYk5XBcEfwMStxaUrWO3cU8QkzYnIToh0aOOX7XkpXuMa
s/3d8gtMtC1btATHFBzZ/WIOwhOif7kRSXS9fOi/SYEh0164HunDuAb1KjEM
tI0dly7JNGPagz3VAsBXaAoxoKXB1wfkdo0I59OmFTDGwoYOLQi5jhQd7csl
A7Am
=1XtM
-END PGP SIGNATURE-



Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Jun 9, 2015 at 8:35 AM, Vickey Singh vickey.singh22...@gmail.com
wrote:

 Thanks Michael for your response.

 Could you also please help in understanding

 #1  On my ceph cluster , how can i confirm if journal is on block device
 partition or on file ?

 #2  Is it true that by default ceph-deploy creates journal on dedicated
 partition and data on another partition if i use the command  ceph-deploy
 osd create ceph-node1:/dev/sdb

 I want to understand the concept of journals creation in Ceph. Hope you
 will help me.

 - vicky

 On Tue, Jun 9, 2015 at 5:28 PM, Michael Kuriger mk7...@yp.com wrote:

   You could mount /dev/sdb to a filesystem, such as /ceph-disk, and then
 do this:
 ceph-deploy osd create ceph-node1:/ceph-disk

  Your journal would be a file doing it this way.



 [image: yp]



 Michael Kuriger

 Sr. Unix Systems Engineer

 * mk7...@yp.com |( 818-649-7235

   From: Vickey Singh vickey.singh22...@gmail.com
 Date: Tuesday, June 9, 2015 at 12:21 AM
 To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
 Subject: [ceph-users] Beginners ceph journal question

   Hello Cephers

  Beginners question on Ceph Journals creation. Need answers from experts.

  - Is it true that by default ceph-deploy creates journal on dedicated
 partition and data on another partition. It does not creates journal on
 file ??

  ceph-deploy osd create ceph-node1:/dev/sdb

  This commands is creating
 data partition : /dev/sdb2
 Journal Partition : /dev/sdb1

  In ceph-deploy command i have not specified journal partition but still
 it creates a journal on sdb1 ?

  - How can i confirm if journal is on block device partition or on file ?

  - How can i create journal on a file ? command would be helpful ?

  Regards
 Vicky



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
At high queue-depths and high IOPS, I would suspect that the bottleneck is 
the single, coarse-grained mutex protecting the cache data structures. It's 
been a back burner item to refactor the current cache mutex into 
finer-grained locks. 

Jason 

Thanks for the explain Jason.

Anyway, inside qemu, I'm around 35-40k with or without rbd_cache, so it's make 
not too much difference currently.
(maybe some other qemu bottleneck).
 

- Mail original -
De: Jason Dillaman dilla...@redhat.com
À: Mark Nelson mnel...@redhat.com
Cc: aderumier aderum...@odiso.com, pushpesh sharma 
pushpesh@gmail.com, ceph-devel ceph-de...@vger.kernel.org, 
ceph-users ceph-users@lists.ceph.com
Envoyé: Mardi 9 Juin 2015 15:39:50
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

 In the past we've hit some performance issues with RBD cache that we've 
 fixed, but we've never really tried pushing a single VM beyond 40+K read 
 IOPS in testing (or at least I never have). I suspect there's a couple 
 of possibilities as to why it might be slower, but perhaps joshd can 
 chime in as he's more familiar with what that code looks like. 
 

At high queue-depths and high IOPS, I would suspect that the bottleneck is the 
single, coarse-grained mutex protecting the cache data structures. It's been a 
back burner item to refactor the current cache mutex into finer-grained locks. 

Jason 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

2015-06-09 Thread Nick Fisk
Hi Kevin,

 

Ceph by default will make sure no copies of the data are on the same host. So 
with a replica count of 3, you could lose 2 hosts without losing any data or 
operational ability. If by some luck all disk failures were constrained to 2 
hosts, you could in theory have up to 8 disks fail. Otherwise if the disk 
failures are spread amongst the hosts, you could withstand 2 disk failures.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of kevin 
parrikar
Sent: 09 June 2015 16:54
To: ceph-users@lists.ceph.com
Subject: [ceph-users] calculating maximum number of disk and node failure that 
can be handled by cluster with out data loss

 

I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also 
hosting 3 monitoring process) with default replica 3.

 

Total OSD disks : 16 

Total Nodes : 4

 

How can i calculate the 

*   Maximum number of disk failures my cluster can handle with out  any 
impact on current data and new writes.
*   Maximum number of node failures  my cluster can handle with out any 
impact on current data and new writes.

Thanks for any help




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I also saw a similar performance increase by using alternative memory
allocators. What I found was that Ceph OSDs performed well with either
tcmalloc or jemalloc (except when RocksDB was built with jemalloc
instead of tcmalloc, I'm still working to dig into why that might be
the case).

However, I found that tcmalloc with QEMU/KVM was very detrimental to
small I/O, but provided huge gains in I/O =1MB. Jemalloc was much
better for QEMU/KVM in the tests that we ran. [1]

I'm currently looking into I/O bottlenecks around the 16KB range and
I'm seeing a lot of time in thread creation and destruction, the
memory allocators are quite a bit down the list (both fio with
ioengine rbd and on the OSDs). I wonder what the difference can be.
I've tried using the async messenger but there wasn't a huge
difference. [2]

Further down the rabbit hole

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html
-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
oSJX
=k281
-END PGP SIGNATURE-

Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
Frankly, I'm a little impressed that without RBD cache we can hit 80K
IOPS from 1 VM!

 Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll 
 have overhead.
 (I'm planning to send results in qemu soon)

How fast are the SSDs in those 3 OSDs?

 Theses results are with datas in buffer memory of osd nodes.

 When reading fulling on ssd (intel s3500),

 For 1 client,

 I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
 I'm around 55k iops without cache and 38k iops with cache, with 3 osd.

 with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops 
 by osd when datas are in buffer.

 (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)



 small tip :
 I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 
 20%

 LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
 LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...

 as a lot of time is spent in malloc/free


 (qemu support also tcmalloc since some months , I'll bench it too
   https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html)



 I'll try to send full bench results soon, from 1 to 18 ssd osd.




 - Mail original -
 De: Mark Nelson mnel...@redhat.com
 À: aderumier aderum...@odiso.com, pushpesh sharma 
 pushpesh@gmail.com
 Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
 ceph-users@lists.ceph.com
 Envoyé: Mardi 9 Juin 2015 13:36:31
 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

 Hi All,

 In the past we've hit some performance issues with RBD cache that we've
 fixed, but we've never really tried pushing a single VM beyond 40+K read
 IOPS in testing (or at least I never have). I suspect there's a couple
 of possibilities as to why it might be slower, but perhaps joshd can
 chime in as he's more familiar with what that code looks like.

 Frankly, I'm a little impressed that without RBD cache we can hit 80K
 IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?

 Mark

 On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
 It's seem that the limit is mainly going in high queue depth (+-  16)

 Here the result in iops with 1client- 4krandread- 3osd - with differents 
 queue depth size.
 rbd_cache is almost the same than without cache with queue depth 16


 cache
 -
 qd1: 1651
 qd2: 3482
 qd4: 7958
 qd8: 17912
 qd16: 36020
 qd32: 42765
 qd64: 46169

 no cache
 
 qd1: 1748
 qd2: 3570
 qd4: 8356
 qd8: 17732
 qd16: 41396
 qd32: 78633
 qd64: 79063
 qd128: 79550


 - Mail original -
 De: aderumier aderum...@odiso.com
 À: pushpesh sharma pushpesh@gmail.com
 Cc: ceph-devel ceph-de...@vger.kernel.org, ceph-users 
 ceph-users@lists.ceph.com
 Envoyé: Mardi 9 Juin 2015 09:28:21
 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

 Hi,

 We tried adding more RBDs to single VM, but no luck.

 If you want to scale with 

Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

2015-06-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

If you are using the default rule set (which I think has min_size 2),
you can sustain 1-4 disk failures or one host failures.

The reason disk failures vary so wildly is that you can lose all the
disks in host.

You can lose up to another 4 disks (in the same host) or 1 host
without data loss, but I/O will block until Ceph can replicate at
least one more copy (assuming the min_size 2 stated above).
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:
 I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also
 hosting 3 monitoring process) with default replica 3.

 Total OSD disks : 16
 Total Nodes : 4

 How can i calculate the

 Maximum number of disk failures my cluster can handle with out  any impact
 on current data and new writes.
 Maximum number of node failures  my cluster can handle with out any impact
 on current data and new writes.

 Thanks for any help

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVdxBACRDmVDuy+mK58QAAfIoQAK0ozApTIkk1dzAdONyX
vJ7r6q4LRQF9OAzA2qRKZdLL9R7bk2i5+VRJ8+Xst/jV9F4jEp+Owy+bZ5JL
F6C3/5tH8fco1enVsYJlivFhOUZij2RpUupFViWe5rDmq0EPwZC3cmFYlA2n
UYtzDqAvOWeNQTUYlE7Ya4+prZexLFofz+N3+k8XylEI0w4++6iR8znxGSfE
jtyXW/zzlZiLO1LZ4vbDviWRk7SRmE5dJV6Tc5HUEmkAB7lgkVJriBpHHY3V
vIs5J5xXB+VH09Y+Ka4E/okyKt/tVd36NMvWz2v9xluOXFb1iLK9yQMyHeqr
JbynllpM5E8JdBTvQq8eW2khZ2q2NaIugoBvhGWGQluoQz0WN82EdevY137a
qR4j2xpHaG0oMwuWgMtUzpg0HcSccs+UQKVzkFCLXBlNnW4m/W63EfZMmh2B
nusQ0LGVoB4EjFTGE5wHabOqUOdkaPCM/pSh9UTw6COXc8ytTbK4FCS3msiO
BvmSYWoQFINfz6bOR2mpud1fB1k+nvEheECC3wZzbEo1w5bMx6lOdLt0kIe4
hJzR7o4TcfNoR/N3CGlfN6d+pk8yxoxVvcIiGTf3uRZZep+t8w6kyrA5XxlR
orvhDwdVOkGVQL8jYGzelWk+Er9ILvHUsL4Semx4PEv8xAR9Dx//UHzyrviQ
YJsn
=B31a
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
Hi Robert,

What I found was that Ceph OSDs performed well with either 
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
instead of tcmalloc, I'm still working to dig into why that might be 
the case). 
yes,from my test, for osd tcmalloc is a little faster (but very little) than 
jemalloc.



However, I found that tcmalloc with QEMU/KVM was very detrimental to 
small I/O, but provided huge gains in I/O =1MB. Jemalloc was much 
better for QEMU/KVM in the tests that we ran. [1]


Just have done qemu test (4k randread - rbd_cache=off), I don't see speed 
regression with tcmalloc.
with qemu iothread, tcmalloc have a speed increase over glib
with qemu iothread, jemalloc have a speed decrease

without iothread, jemalloc have a big speed increase

this is with 
-qemu 2.3
-tcmalloc 2.2.1
-jemmaloc 3.6
-libc6 2.19


qemu : no iothread : glibc: iops=33395
qemu : no-iothread : tcmalloc : iops=34516 (+3%)
qemu : no-iothread : jemmaloc : iops=42226 (+26%)

qemu : iothread : glibc   : iops=34516
qemu : iothread :tcmalloc : iops=38676 (+12%)
qemu : iothread :jemmaloc : iops=28023 (-19%)


(The benefit of iothreads is that we can scale with more disks in 1vm)


fio results:


qemu : iothread : tcmalloc : iops=38676
-
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 
00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun  9 18:16:53 
2015
  read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
 lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
clat percentiles (usec):
 |  1.00th=[  402],  5.00th=[  466], 10.00th=[  510], 20.00th=[  572],
 | 30.00th=[  636], 40.00th=[  716], 50.00th=[  780], 60.00th=[  852],
 | 70.00th=[  932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
 | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
 | 99.99th=[ 3888]
bw (KB  /s): min=123888, max=198584, per=100.00%, avg=154824.40, 
stdev=16978.03
lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
  cpu  : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0%
 issued: total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, 
mint=33889msec, maxt=33889msec

Disk stats (read/write):
  vdb: ios=1302739/0, merge=0/0, ticks=93/0, in_queue=934096, util=99.77%



qemu : no-iothread : tcmalloc : iops=34516
-
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 
00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun  9 18:19:08 
2015
  read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
 lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
clat percentiles (usec):
 |  1.00th=[  434],  5.00th=[  510], 10.00th=[  564], 20.00th=[  652],
 | 30.00th=[  732], 40.00th=[  812], 50.00th=[  876], 60.00th=[  940],
 | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
 | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
 | 99.99th=[ 4320]
bw (KB  /s): min=77312, max=185576, per=99.74%, avg=137709.88, 
stdev=16883.77
lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
  cpu  : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0%
 issued: total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, 
mint=37974msec, maxt=37974msec

Disk stats (read/write):
  vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%



qemu : iothread : glibc : iops=34516
-

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
fio-2.1.11
Starting 1 process

[ceph-users] .New Ceph cluster - cannot add additional monitor

2015-06-09 Thread Mike Carlson
We have a new ceph cluster, and when I follow the guide (
http://ceph.com/docs/master/start/quick-ceph-deploy/) during the section
where you can add additional monitors, it fails, and it almost seems like
its using an improper ip address


We have 4 nodes:

   - lts-mon
   - lts-osd1
   - lts-osd2
   - lts-osd3

Using, ceph-deploy, we have created a new cluster with lts-mon as the
initial monitor:


ceph-deploy new lts-mon
ceph-deploy install lts-mon lts-osd1 lts-osd2 lts-osd3
ceph-deploy mon create-initial

ceph-deploy osd prepare 


ceph-deploy mds lts-mon


The only modifications I made to ceph.conf were to include the public and
cluster network settings, and set the osd pool default size:

[global]
fsid = 5ca0e0f5-d367-48b8-97b4-48e8b12fd517
mon_initial_members = lts-mon
mon_host = 10.5.68.236
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 3
public_network = 10.5.68.0/22
cluster_network = 10.1.1.0/24


This all seemed fine, and after adding in all of our osd's, ceph -s reports:

# ceph -s
cluster f4adbd94-bf49-42f2-bd57-ebc7db9aa863
 health HEALTH_WARN
too few PGs per OSD (1  min 30)
 monmap e1: 1 mons at {lts-mon=10.5.68.236:6789/0}
election epoch 1, quorum 0 lts-mon
 osdmap e471: 102 osds: 102 up, 102 in
  pgmap v973: 64 pgs, 1 pools, 0 bytes data, 0 objects
515 GB used, 370 TB / 370 TB avail
  64 active+clean


We have not defined the default pg so the warning seems okay for now

The problem we have is when adding a new monitor:

ceph-deploy mon create lts-osd1

[ceph_deploy.conf][DEBUG ] found configuration file at:
/root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.25): /usr/local/bin/ceph-deploy mon
create lts-osd1
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts lts-osd1
[ceph_deploy.mon][DEBUG ] detecting platform for host lts-osd1 ...
[lts-osd1][DEBUG ] connection detected need for sudo
[lts-osd1][DEBUG ] connected to host: lts-osd1
[lts-osd1][DEBUG ] detect platform information from remote host
[lts-osd1][DEBUG ] detect machine type
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty
[lts-osd1][DEBUG ] determining if provided host has same hostname in remote
[lts-osd1][DEBUG ] get remote short hostname
[lts-osd1][DEBUG ] deploying mon to lts-osd1
[lts-osd1][DEBUG ] get remote short hostname
[lts-osd1][DEBUG ] remote hostname: lts-osd1
[lts-osd1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[lts-osd1][DEBUG ] create the mon path if it does not exist
[lts-osd1][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-lts-osd1/done
[lts-osd1][DEBUG ] create a done file to avoid re-doing the mon deployment
[lts-osd1][DEBUG ] create the init path if it does not exist
[lts-osd1][DEBUG ] locating the `service` executable...
[lts-osd1][INFO  ] Running command: sudo initctl emit ceph-mon cluster=ceph
id=lts-osd1
[lts-osd1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.lts-osd1.asok mon_status
[lts-osd1][DEBUG ]

[lts-osd1][DEBUG ] status for monitor: mon.lts-osd1
[lts-osd1][DEBUG ] {
[lts-osd1][DEBUG ]   election_epoch: 0,
[lts-osd1][DEBUG ]   extra_probe_peers: [
[lts-osd1][DEBUG ] 10.5.68.236:6789/0
[lts-osd1][DEBUG ]   ],
[lts-osd1][DEBUG ]   monmap: {
[lts-osd1][DEBUG ] created: 0.00,
[lts-osd1][DEBUG ] epoch: 0,
[lts-osd1][DEBUG ] fsid: 5ca0e0f5-d367-48b8-97b4-48e8b12fd517,
[lts-osd1][DEBUG ] modified: 0.00,
[lts-osd1][DEBUG ] mons: [
[lts-osd1][DEBUG ]   {
[lts-osd1][DEBUG ] addr: 0.0.0.0:0/1,
[lts-osd1][DEBUG ] name: lts-mon,
[lts-osd1][DEBUG ] rank: 0
[lts-osd1][DEBUG ]   }
[lts-osd1][DEBUG ] ]
[lts-osd1][DEBUG ]   },
[lts-osd1][DEBUG ]   name: lts-osd1,
[lts-osd1][DEBUG ]   outside_quorum: [],
[lts-osd1][DEBUG ]   quorum: [],
[lts-osd1][DEBUG ]   rank: -1,
[lts-osd1][DEBUG ]   state: probing,
[lts-osd1][DEBUG ]   sync_provider: []
[lts-osd1][DEBUG ] }
[lts-osd1][DEBUG ]

[lts-osd1][INFO  ] monitor: mon.lts-osd1 is currently at the state of
probing
[lts-osd1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.lts-osd1.asok mon_status
[lts-osd1][WARNIN] lts-osd1 is not defined in `mon initial members`
[lts-osd1][WARNIN] monitor lts-osd1 does not exist in monmap


the monitor I was trying to add shows:

2015-06-09 11:33:24.661466 7fef2a806700  0 cephx: verify_reply couldn't
decrypt with error: error decoding block for decryption
2015-06-09 11:33:24.661478 7fef2a806700  0 -- 10.5.68.229:6789/0 
10.5.68.236:6789/0 pipe(0x3571000 sd=13 :40912 s=1 pgs=0 cs=0 l=0
c=0x34083c0).failed verifying authorize reply
2015-06-09 11:33:24.763579 7fef2eb83700  0 log_channel(audit) log [DBG] 

[ceph-users] RGW blocked threads/timeouts

2015-06-09 Thread Daniel Maraio

Hello Cephers,

  I had a question about something we experience in our cluster. When 
we add new capacity or suffer failures we will often get blocked 
requests during the rebuilding. This leads to threads from the RGW 
blocking and eventually no longer serving new requests. I suspect that 
if we set the RGW thread timeouts low enough this could alleviate the 
problem. We don't necessary care if a certain portion of requests get 
ignored during this time period. So long as the RGW can respond to some 
of them.


  So my question is, has anyone else experienced this and what have you 
done to solve it. The two timeout settings I am looking at are listed 
below, and i'm not certain what the distinction is between them, perhaps 
someone could fill me in. Thank you and I appreciate the assistance!


  The documentation is not too clear about the differences and after 
some brief searching I didn't find any discussions about these values.


rgwopthreadtimeout
rgwopthreadsuicidetimeout

- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)

2015-06-09 Thread Gregory Farnum
On Mon, Jun 8, 2015 at 5:20 PM, Francois Lafont flafdiv...@free.fr wrote:
 Hi,

 On 27/05/2015 22:34, Gregory Farnum wrote:

 Sorry for the delay; I've been traveling.

 No problem, me too, I'm not really fast to answer. ;)

 Ok, I see. According to the online documentation, the way to close
 a cephfs client session is:

 ceph daemon mds.$id session ls # to get the $session_id and the 
 $address
 ceph osd blacklist add $address
 ceph osd dump  # to get the $epoch
 ceph daemon mds.$id osdmap barrier $epoch
 ceph daemon mds.$id session evict $session_id

 Is it correct?

 With the commands above, could I reproduce the client freeze in my testing
 cluster?

 Yes, I believe so.

 In fact, after some tests, the commands above evicts correctly the client
 (`ceph daemon mds.1 session ls` returns an empty array) but in the client
 side a new connection is automatically established as soon as the cephfs
 mountpoint is requested.

What do you mean, as soon as it's requested? The session evict is a
polite close, yes, and there's nothing blocking future mounts if you
try and do it again or if you don't have any caps...but if you have
open files I'd expect things to get stuck. Maybe I'm overlooking
something.

 In fact, I haven't succeeded in reproducing the
 freeze. ;) I have tried to stop the network in the client side (ifdown -a)
 and after few minutes (more than 60 seconds though), I have seen in the
 mds log closing stale session client. But after a `ifup -a`, I have
 get back a cephfs connection and a mountpoint in good health.

Was this while you were doing writes to the filesystem, or was it
idle? I don't remember all the mechanisms in great detail but if the
mount is totally idle I'd expect it to behave much differently from
one where you have files open and being written to.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephx error - renew key

2015-06-09 Thread tombo
  

Hello guys, 

today we had one storage (19xosd) down for 4 hours
and now we are observing different problems and when I tried to restart
one osd, I got error related to cephx 

2015-06-09 21:09:49.983522
7fded00c7700 0 auth: could not find secret_id=6238
2015-06-09
21:09:49.983585 7fded00c7700 0 cephx: verify_authorizer could not get
service secret for service osd secret_id=6238
2015-06-09 21:09:49.983595
7fded00c7700 0 -- X.X.X.32:6808/728850  X.X.X.32:6852/1474277
pipe(0x7fdf47291200 sd=90 :6808 s=0 p
gs=0 cs=0 l=0
c=0x7fdf33340940).accept: got bad authorizer

configuration is 

auth
cluster required = cephx
auth service required = none
auth client
required = none

So as I understand, it is not possible to disable whole
auth on fly...so it is possible to renew key for osd to see if it helps?
If yes, how? Remove old with

ceph auth del osd.{osd-num} and generate
new ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i
/var/lib/ceph/osd/ceph-{osd-num}/keyring ? And I don't want to loose
that osd data ( as usually, nobody wants :) )

Thanks for help.
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] apply/commit latency

2015-06-09 Thread Gregory Farnum
On Thu, Jun 4, 2015 at 3:57 AM, Межов Игорь Александрович
me...@yuterra.ru wrote:
 Hi!

 My deployments have seen many different versions of ceph. Pre 0.80.7, I've
 seen those numbers being pretty high. After upgrading to 0.80.7, all of a
 sudden, commit latency of all OSDs drop to 0-1ms, and apply latency
 remains
 pretty low most of the time.

 We use now Ceph 0.80.7-1~bpo70+1 on Debian Wheezy + 3.16.4 kernel,
 backported
 from Jessie. And I can't see commit latency in perf dumps, only
 commitcycle_latency.
 Is it the right perf parameters you discuss here? Our values are too high -
 on the nodes
 with RAID0-per-disk it is between 20 and 120 ms, on the nodes with straight
 HBA
 passtrough is worser - 200-600ms.

This particular value is basically how long the syncfs call takes to
complete. It's not the direct time for a particular operation to
commit.


 But apply latency is between 3 and 19ms with avg=7.2 ms, journal latencies
 are
 also good = 0.49-1.84 ms.

Yeah, those seem reasonable.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Irek Fasikhov
Hi, Alexandre.

Very good work!
Do you have a rpm-file?
Thanks.

2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER aderum...@odiso.com:

 Hi,

 I have tested qemu with last tcmalloc 2.4, and the improvement is huge
 with iothread: 50k iops (+45%) !



 qemu : no iothread : glibc : iops=33395
 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%)
 qemu : no-iothread : jemmaloc : iops=42226 (+26%)
 qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)


 qemu : iothread : glibc : iops=34516
 qemu : iothread : tcmalloc : iops=38676 (+12%)
 qemu : iothread : jemmaloc : iops=28023 (-19%)
 qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)





 qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
 --
 rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
 ioengine=libaio, iodepth=32
 fio-2.1.11
 Starting 1 process
 Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops]
 [eta 00m:00s]
 rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10
 05:54:24 2015
   read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec
 slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58
 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71
  lat (usec): min=149, max=6265, avg=635.27, stdev=197.40
 clat percentiles (usec):
  |  1.00th=[  318],  5.00th=[  378], 10.00th=[  418], 20.00th=[  474],
  | 30.00th=[  516], 40.00th=[  564], 50.00th=[  612], 60.00th=[  652],
  | 70.00th=[  700], 80.00th=[  756], 90.00th=[  860], 95.00th=[  980],
  | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
  | 99.99th=[ 3760]
 bw (KB  /s): min=145608, max=249688, per=100.00%, avg=201108.00,
 stdev=21718.87
 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63%
 lat (msec) : 2=4.46%, 4=0.03%, 10=0.01%
   cpu  : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 =64=0.0%
  issued: total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s,
 mint=26070msec, maxt=26070msec

 Disk stats (read/write):
   vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840,
 util=99.73%






 rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
 ioengine=libaio, iodepth=32
 fio-2.1.11
 Starting 1 process
 Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops]
 [eta 00m:00s]
 rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10
 06:05:06 2015
   read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec
 slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35
 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65
  lat (usec): min=289, max=4743, avg=888.31, stdev=315.51
 clat percentiles (usec):
  |  1.00th=[  462],  5.00th=[  516], 10.00th=[  548], 20.00th=[  596],
  | 30.00th=[  652], 40.00th=[  764], 50.00th=[  868], 60.00th=[  940],
  | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
  | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
  | 99.99th=[ 3632]
 bw (KB  /s): min=98352, max=177328, per=99.91%, avg=143772.11,
 stdev=21782.39
 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01%
 lat (msec) : 2=29.74%, 4=1.07%, 10=0.01%
   cpu  : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 =64=0.0%
  issued: total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s,
 mint=36435msec, maxt=36435msec

 Disk stats (read/write):
   vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716,
 util=99.85%


 - Mail original -
 De: aderumier aderum...@odiso.com
 À: Robert LeBlanc rob...@leblancnet.us
 Cc: Mark Nelson mnel...@redhat.com, ceph-devel 
 ceph-de...@vger.kernel.org, pushpesh sharma pushpesh@gmail.com,
 ceph-users ceph-users@lists.ceph.com
 Envoyé: Mardi 9 Juin 2015 18:47:27
 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

 Hi Robert,

 What I found was that Ceph OSDs performed well with either
 tcmalloc or jemalloc (except when RocksDB was built with jemalloc
 instead of tcmalloc, I'm still working to dig into why that might be
 the case).
 yes,from my test, for osd tcmalloc is a little faster (but very 

Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
Very good work! 
Do you have a rpm-file? 
Thanks. 
no sorry, I'm have compiled it manually (and I'm using debian jessie as client)



- Mail original -
De: Irek Fasikhov malm...@gmail.com
À: aderumier aderum...@odiso.com
Cc: Robert LeBlanc rob...@leblancnet.us, ceph-devel 
ceph-de...@vger.kernel.org, pushpesh sharma pushpesh@gmail.com, 
ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 10 Juin 2015 07:21:42
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi, Alexandre. 

Very good work! 
Do you have a rpm-file? 
Thanks. 

2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER  aderum...@odiso.com  : 


Hi, 

I have tested qemu with last tcmalloc 2.4, and the improvement is huge with 
iothread: 50k iops (+45%) ! 



qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 


qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 
qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 





qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
-- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 
00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 
2015 
read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec 
slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 
clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 
lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 
clat percentiles (usec): 
| 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
| 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
| 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
| 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
| 99.99th=[ 3760] 
bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 
lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% 
lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% 
cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, 
mint=26070msec, maxt=26070msec 

Disk stats (read/write): 
vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 






rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 
00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 
2015 
read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec 
slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 
clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 
lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 
clat percentiles (usec): 
| 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
| 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
| 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
| 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
| 99.99th=[ 3632] 
bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 
lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% 
lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% 
cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, 
mint=36435msec, maxt=36435msec 

Disk stats (read/write): 
vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 


- Mail original - 
De: aderumier  aderum...@odiso.com  
À: Robert LeBlanc  rob...@leblancnet.us  
Cc: Mark Nelson  mnel...@redhat.com , ceph-devel  
ceph-de...@vger.kernel.org , pushpesh sharma  pushpesh@gmail.com , 
ceph-users  ceph-users@lists.ceph.com  
Envoyé: Mardi 9 Juin 2015 18:47:27 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

Hi Robert, 

What I 

Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Jason Dillaman
 In the past we've hit some performance issues with RBD cache that we've
 fixed, but we've never really tried pushing a single VM beyond 40+K read
 IOPS in testing (or at least I never have).  I suspect there's a couple
 of possibilities as to why it might be slower, but perhaps joshd can
 chime in as he's more familiar with what that code looks like.
 

At high queue-depths and high IOPS, I would suspect that the bottleneck is the 
single, coarse-grained mutex protecting the cache data structures.  It's been a 
back burner item to refactor the current cache mutex into finer-grained locks.

Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nginx access ceph

2015-06-09 Thread Ram Chander
Hi,

I am trying to setup nginx to access  html files in ceph buckets.
I have setup -  https://github.com/anomalizer/ngx_aws_auth

Below is the nginx config . When I try to access

http://hostname:8080/test/b.html - shows signature mismatch.
http://hostname:8080/b.html - shows signature mismatch.

I could see the request passed from nginx to  ceph in ceph logs.



server {
listen   8080;
server_name  localhost;



location / {

   proxy_pass http://10.84.182.80:8080;

   aws_access_key GMO31LL1LECV1RH4T71K;
   aws_secret_key aXEf9e1Aq85VTz7Q5tkXeq4qZaEtnYP04vSTIFBB;
   s3_buckettest;
   set $url_full '$1';
   chop_prefix /test;

proxy_set_header Authorization $s3_auth_token;
  proxy_set_header x-amz-date $aws_date;
}
}

I have set ceph bucket as public ( not private ).
Request to kindly  help.


http://pastebin.com/Lhyhk7xk



Regards,
Ram
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-09 Thread Alexandre DERUMIER
Hi,

I have tested qemu with last tcmalloc 2.4, and the improvement is huge with 
iothread: 50k iops (+45%) !



qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)


qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 
qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 





qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
--
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 
00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 
2015
  read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec
slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58
clat (usec): min=128, max=6262, avg=631.41, stdev=197.71
 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40
clat percentiles (usec):
 |  1.00th=[  318],  5.00th=[  378], 10.00th=[  418], 20.00th=[  474],
 | 30.00th=[  516], 40.00th=[  564], 50.00th=[  612], 60.00th=[  652],
 | 70.00th=[  700], 80.00th=[  756], 90.00th=[  860], 95.00th=[  980],
 | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
 | 99.99th=[ 3760]
bw (KB  /s): min=145608, max=249688, per=100.00%, avg=201108.00, 
stdev=21718.87
lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63%
lat (msec) : 2=4.46%, 4=0.03%, 10=0.01%
  cpu  : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0%
 issued: total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, 
mint=26070msec, maxt=26070msec

Disk stats (read/write):
  vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%






rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 
00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 
2015
  read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec
slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35
clat (usec): min=191, max=4740, avg=884.66, stdev=315.65
 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51
clat percentiles (usec):
 |  1.00th=[  462],  5.00th=[  516], 10.00th=[  548], 20.00th=[  596],
 | 30.00th=[  652], 40.00th=[  764], 50.00th=[  868], 60.00th=[  940],
 | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
 | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
 | 99.99th=[ 3632]
bw (KB  /s): min=98352, max=177328, per=99.91%, avg=143772.11, 
stdev=21782.39
lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01%
lat (msec) : 2=29.74%, 4=1.07%, 10=0.01%
  cpu  : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0%
 issued: total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, 
mint=36435msec, maxt=36435msec

Disk stats (read/write):
  vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%


- Mail original -
De: aderumier aderum...@odiso.com
À: Robert LeBlanc rob...@leblancnet.us
Cc: Mark Nelson mnel...@redhat.com, ceph-devel 
ceph-de...@vger.kernel.org, pushpesh sharma pushpesh@gmail.com, 
ceph-users ceph-users@lists.ceph.com
Envoyé: Mardi 9 Juin 2015 18:47:27
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi Robert, 

What I found was that Ceph OSDs performed well with either 
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
instead of tcmalloc, I'm still working to dig into why that might be 
the case). 
yes,from my test, for osd tcmalloc is a little faster (but very little) than 
jemalloc. 



However, I found that tcmalloc with QEMU/KVM was very detrimental to 
small I/O, but provided huge gains in I/O =1MB. Jemalloc was much 
better for QEMU/KVM in the tests that we ran.