Re: [ceph-users] ceph same rbd on multiple client

2015-10-22 Thread gjprabu
Hi Frederic,

 

   Can you give me some solution, we are spending more time to solve 
this issue.



Regards

Prabu









  On Thu, 15 Oct 2015 17:14:13 +0530 Tyler Bishop 
 wrote 




I don't know enough on ocfs to help.  Sounds like you have unconccurent writes 
though

Sent from TypeMail


On Oct 15, 2015, at 1:53 AM, gjprabu  wrote:

Hi Tyler,



   Can please send me the next setup action to be taken on this issue.



Regards

Prabu






 On Wed, 14 Oct 2015 13:43:29 +0530 gjprabu  
wrote 




Hi Tyler,



 Thanks for your reply. We have disabled rbd_cache but still issue is 
persist. Please find our configuration file.



# cat /etc/ceph/ceph.conf

[global]

fsid = 944fa0af-b7be-45a9-93ff-b9907cfaee3f

mon_initial_members = integ-hm5, integ-hm6, integ-hm7

mon_host = 192.168.112.192,192.168.112.193,192.168.112.194

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_pool_default_size = 2



[mon]

mon_clock_drift_allowed = .500



[client]

rbd_cache = false



--



 cluster 944fa0af-b7be-45a9-93ff-b9907cfaee3f

 health HEALTH_OK

 monmap e2: 3 mons at 
{integ-hm5=192.168.112.192:6789/0,integ-hm6=192.168.112.193:6789/0,integ-hm7=192.168.112.194:6789/0}

election epoch 480, quorum 0,1,2 integ-hm5,integ-hm6,integ-hm7

 osdmap e49780: 2 osds: 2 up, 2 in

  pgmap v2256565: 190 pgs, 2 pools, 1364 GB data, 410 kobjects

2559 GB used, 21106 GB / 24921 GB avail

 190 active+clean

  client io 373 kB/s rd, 13910 B/s wr, 103 op/s





Regards

Prabu



  On Tue, 13 Oct 2015 19:59:38 +0530 Tyler Bishop 
 wrote 




You need to disable RBD caching.







 Tyler Bishop
Chief Technical Officer
 513-299-7108 x10
 
tyler.bis...@beyondhosting.net

 
 
If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.

 









From: "gjprabu" 

To: "Frédéric Nass" 

Cc: "" , 
"Siva Sokkumuthu" , "Kamal Kannan 
Subramani(kamalakannan)" 

Sent: Tuesday, October 13, 2015 9:11:30 AM

Subject: Re: [ceph-users] ceph same rbd on multiple client




Hi ,




 We have CEPH  RBD with OCFS2 mounted servers. we are facing i/o errors 
simultaneously while move the folder using one nodes in the same disk other 
nodes data replicating with below said error (Copying is not having any 
problem). Workaround if we remount the partition this issue get resolved but 
after sometime problem again reoccurred. please help on this issue.



Note : We have total 5 Nodes, here two nodes working fine other nodes are 
showing like below input/output error on moved data's.



ls -althr 

ls: cannot access LITE_3_0_M4_1_TEST: Input/output error 

ls: cannot access LITE_3_0_M4_1_OLD: Input/output error 

total 0 

d? ? ? ? ? ? LITE_3_0_M4_1_TEST 

d? ? ? ? ? ? LITE_3_0_M4_1_OLD 



Regards

Prabu






 On Fri, 22 May 2015 17:33:04 +0530 Frédéric Nass 
 wrote 




Hi,



Waiting for CephFS, you can use clustered filesystem like OCFS2 or GFS2 on top 
of RBD mappings so that each host can access the same device and clustered 
filesystem.



Regards,



Frédéric.



Le 21/05/2015 16:10, gjprabu a écrit :





-- Frédéric Nass Sous direction des Infrastructures, Direction du Numérique, 
Université de Lorraine. Tél : 03.83.68.53.83
___ 

ceph-users mailing list 

ceph-users@lists.ceph.com 

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


Hi All,



We are using rbd and map the same rbd image to the rbd device on two 
different client but i can't see the data until i umount and mount -a 
partition. Kindly share the solution for this issue.



Example

create rbd image named foo

map foo to /dev/rbd0 on server A,   mount /dev/rbd0 to /mnt

map foo to /dev/rbd0 on server B,   mount /dev/rbd0 to /mnt



Regards

Prabu








___ ceph-users mailing list 
ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



















___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-22 Thread hzwuli...@gmail.com
Yeah, you are right.  Test the rbd volume form host is fine.

Now, at least we could affirm ti's the qemu or kvm problem, not ceph.



hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-23 12:51
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
>>Anyway, i could try to collect somthing, maybe there are some clues. 
 
And you don't have problem to read/write to this rbd from host with fio-rbd ? 
(try a read full the rbd volume for example)
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 23 Octobre 2015 06:42:41
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Oh, no, from the phenomenon. IO in VM is wait for the host to completion. 
The CPU wait in VM is very high. 
Anyway, i could try to collect somthing, maybe there are some clues. 
 
 
hzwuli...@gmail.com 
 
 
 
From: Alexandre DERUMIER 
Date: 2015-10-23 12:39 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Do you have tried to use perf inside the faulty guest too ? 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Vendredi 23 Octobre 2015 06:15:07 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
btw, we use perf to track the process qemu-system-x86(15801), there is an 
abnormal function: 
Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 
- 75.23% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock 
- do_raw_spin_lock 
+ 54.44% 0x7fc79fc769d9 + 45.31% 0x7fc79fc769ab 
So, maybe it's the kvm problem? 
hzwuli...@gmail.com 
From: hzwuli...@gmail.com 
Date: 2015-10-23 11:54 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, list 
We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420): 
15801 libvirt- 20 0 33.7g 1.4g 11m R 1420 0.6 1322:26 qemu-system-x86 
quem-system-x86 process 15801 is responsible for the VM. 
Anyone has ever run into this problem also. 
hzwuli...@gmail.com 
BQ_BEGIN 
From: hzwuli...@gmail.com 
Date: 2015-10-22 10:15 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, 
Sure, all those could help, but not so much -:) 
Now, we find it's the VM problem. CPU on the host is very high. 
We create a new VM could solve this problem, but don't know why until now. 
Here is the detail version info: 
Compiled against library: libvirt 1.2.9 
Using library: libvirt 1.2.9 
Using API: QEMU 1.2.9 
Running hypervisor: QEMU 2.1.2 
Are there any already know bugs about those version? 
Thanks! 
hzwuli...@gmail.com 
BQ_BEGIN 
From: Alexandre DERUMIER 
Date: 2015-10-21 18:38 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
here a libvirt sample to enable iothreads: 
 
2 
 
 
 
 
 
 
 
 
 
 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 10:31:56 
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[globa

Re: [ceph-users] ceph-fuse and its memory usage

2015-10-22 Thread Goncalo Borges

Thank you! It seems there is an explanation for the behavior, which is good!

On 10/23/2015 12:44 AM, Gregory Farnum wrote:

On Thu, Oct 22, 2015 at 1:59 AM, Yan, Zheng  wrote:

direct IO only bypass kernel page cache. data still can be cached in
ceph-fuse. If I'm correct, the test repeatedly writes data to 8M
files. The cache make multiple write  assimilate into single OSD
write

Ugh, of course. I don't see a tracker ticket for that, so I made one:
http://tracker.ceph.com/issues/13569
-Greg


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-22 Thread Alexandre DERUMIER
>>Anyway, i could try to collect somthing, maybe there are some clues. 

And you don't have problem to read/write to this rbd from host with fio-rbd ? 
(try a read full the rbd volume for example)

- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 23 Octobre 2015 06:42:41
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

Oh, no, from the phenomenon. IO in VM is wait for the host to completion. 
The CPU wait in VM is very high. 
Anyway, i could try to collect somthing, maybe there are some clues. 


hzwuli...@gmail.com 



From: Alexandre DERUMIER 
Date: 2015-10-23 12:39 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Do you have tried to use perf inside the faulty guest too ? 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Vendredi 23 Octobre 2015 06:15:07 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
btw, we use perf to track the process qemu-system-x86(15801), there is an 
abnormal function: 
Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 
- 75.23% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock 
- do_raw_spin_lock 
+ 54.44% 0x7fc79fc769d9 + 45.31% 0x7fc79fc769ab 
So, maybe it's the kvm problem? 
hzwuli...@gmail.com 
From: hzwuli...@gmail.com 
Date: 2015-10-23 11:54 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, list 
We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420): 
15801 libvirt- 20 0 33.7g 1.4g 11m R 1420 0.6 1322:26 qemu-system-x86 
quem-system-x86 process 15801 is responsible for the VM. 
Anyone has ever run into this problem also. 
hzwuli...@gmail.com 
BQ_BEGIN 
From: hzwuli...@gmail.com 
Date: 2015-10-22 10:15 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, 
Sure, all those could help, but not so much -:) 
Now, we find it's the VM problem. CPU on the host is very high. 
We create a new VM could solve this problem, but don't know why until now. 
Here is the detail version info: 
Compiled against library: libvirt 1.2.9 
Using library: libvirt 1.2.9 
Using API: QEMU 1.2.9 
Running hypervisor: QEMU 2.1.2 
Are there any already know bugs about those version? 
Thanks! 
hzwuli...@gmail.com 
BQ_BEGIN 
From: Alexandre DERUMIER 
Date: 2015-10-21 18:38 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
here a libvirt sample to enable iothreads: 
 
2 
 
 
 
 
 
 
 
 
 
 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 10:31:56 
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, a

Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-22 Thread hzwuli...@gmail.com
Oh, no, from the phenomenon. IO in VM is wait for the host to completion.
The CPU wait in VM is very high. 
Anyway, i could try to collect somthing, maybe there are some clues.



hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-23 12:39
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Do you have tried to use perf inside the faulty guest too ?
 
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 23 Octobre 2015 06:15:07
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
btw, we use perf to track the process qemu-system-x86(15801), there is an 
abnormal function: 
 
Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 
- 75.23% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock 
- do_raw_spin_lock 
+ 54.44% 0x7fc79fc769d9 + 45.31% 0x7fc79fc769ab 
 
So, maybe it's the kvm problem? 
 
 
hzwuli...@gmail.com 
 
 
 
From: hzwuli...@gmail.com 
Date: 2015-10-23 11:54 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, list 
 
We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420): 
 
15801 libvirt- 20 0 33.7g 1.4g 11m R 1420 0.6 1322:26 qemu-system-x86 
 
quem-system-x86 process 15801 is responsible for the VM. 
 
Anyone has ever run into this problem also. 
 
hzwuli...@gmail.com 
 
BQ_BEGIN
 
From: hzwuli...@gmail.com 
Date: 2015-10-22 10:15 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, 
 
Sure, all those could help, but not so much -:) 
Now, we find it's the VM problem. CPU on the host is very high. 
 
We create a new VM could solve this problem, but don't know why until now. 
 
Here is the detail version info: 
 
Compiled against library: libvirt 1.2.9 
Using library: libvirt 1.2.9 
Using API: QEMU 1.2.9 
Running hypervisor: QEMU 2.1.2 
 
Are there any already know bugs about those version? 
 
Thanks! 
 
 
hzwuli...@gmail.com 
 
BQ_BEGIN
 
From: Alexandre DERUMIER 
Date: 2015-10-21 18:38 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
here a libvirt sample to enable iothreads: 
 
2 
 
 
 
 
 
 
 
 
 
 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 10:31:56 
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 tes

Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-22 Thread Alexandre DERUMIER
Do you have tried to use perf inside the faulty guest too ?


- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 23 Octobre 2015 06:15:07
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

btw, we use perf to track the process qemu-system-x86(15801), there is an 
abnormal function: 

Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 
- 75.23% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock 
- do_raw_spin_lock 
+ 54.44% 0x7fc79fc769d9 + 45.31% 0x7fc79fc769ab 

So, maybe it's the kvm problem? 


hzwuli...@gmail.com 



From: hzwuli...@gmail.com 
Date: 2015-10-23 11:54 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, list 

We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420): 

15801 libvirt- 20 0 33.7g 1.4g 11m R 1420 0.6 1322:26 qemu-system-x86 

quem-system-x86 process 15801 is responsible for the VM. 

Anyone has ever run into this problem also. 

hzwuli...@gmail.com 

BQ_BEGIN

From: hzwuli...@gmail.com 
Date: 2015-10-22 10:15 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, 

Sure, all those could help, but not so much -:) 
Now, we find it's the VM problem. CPU on the host is very high. 

We create a new VM could solve this problem, but don't know why until now. 

Here is the detail version info: 

Compiled against library: libvirt 1.2.9 
Using library: libvirt 1.2.9 
Using API: QEMU 1.2.9 
Running hypervisor: QEMU 2.1.2 

Are there any already know bugs about those version? 

Thanks! 


hzwuli...@gmail.com 

BQ_BEGIN

From: Alexandre DERUMIER 
Date: 2015-10-21 18:38 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
here a libvirt sample to enable iothreads: 
 
2 
 
 
 
 
 
 
 
 
 
 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 10:31:56 
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 4

Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-22 Thread hzwuli...@gmail.com
btw, we use perf to track the process qemu-system-x86(15801), there is an 
abnormal function:

Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 


-  75.23%  qemu-system-x86  [kernel.kallsyms][k] do_raw_spin_lock   

   
   - do_raw_spin_lock   

   
  + 54.44% 0x7fc79fc769d9   


 + 45.31% 0x7fc79fc769ab

So, maybe it's the kvm problem?



hzwuli...@gmail.com
 
From: hzwuli...@gmail.com
Date: 2015-10-23 11:54
To: Alexandre DERUMIER
CC: ceph-users
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Hi, list

We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420):

15801 libvirt-  20   0 33.7g 1.4g  11m R  1420  0.6   1322:26 qemu-system-x86   
 

quem-system-x86 process 15801 is responsible for the VM.

Anyone has ever run into this problem also.


hzwuli...@gmail.com
 
From: hzwuli...@gmail.com
Date: 2015-10-22 10:15
To: Alexandre DERUMIER
CC: ceph-users
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Hi, 

Sure, all those could help, but not so much -:)
Now, we find it's the VM problem. CPU on the host is very high.

We create a new VM could solve this problem, but don't know why until now.

Here is the detail version info:

Compiled against library: libvirt 1.2.9
Using library: libvirt 1.2.9
Using API: QEMU 1.2.9
Running hypervisor: QEMU 2.1.2

Are there any already know bugs about those version?

Thanks!



hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-21 18:38
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
here a libvirt sample to enable iothreads:
 

   2

  
  
  

 
  
  
  
 

 

 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too)
 
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 10:31:56
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
 
 
Thanks! 
 
hzwuli...@gmail.com 
 
 
 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
io

Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread Ryan Tokarek

> On Oct 22, 2015, at 10:19 PM, John-Paul Robinson  wrote:
> 
> A few clarifications on our experience:
> 
> * We have 200+ rbd images mounted on our RBD-NFS gateway.  (There's
> nothing easier for a user to understand than "your disk is full".)

Same here, and agreed. It sounds like our situations are similar except for my 
blocking on an apparently healthy cluster issue. 

> * I'd expect more contention potential with a single shared RBD back
> end, but with many distinct and presumably isolated backend RBD images,
> I've always been surprised that *all* the nfsd task hang.  This leads me
> to think  it's an nfsd issue rather than and rbd issue.  (I realize this
> is an rbd list, looking for shared experience. ;) )

It's definitely possible. I've experienced exactly the behavior you're seeing. 
My guess is that when an nfsd thread blocks and goes dark, affected clients 
(even if it's only one) will retransmit their requests thinking there's a 
network issue causing more nfsds to go dark until all the server threads are 
stuck (that could be hogwash, but it fits the behavior). Or perhaps there are 
enough individual clients writing to the affected NFS volume that they consume 
all the available nfsd threads (I'm not sure about your client to FS and nfsd 
thread ratio, but that is plausible in my situation).  I think some testing 
with xfs_freeze and non-critical nfs server/clients is called for. 

I don't think this part is related to ceph except that it happens to be 
providing the underlying storage. I'm fairly certain that my problems with an 
apparently healthy cluster blocking writes is a ceph problem, but I haven't 
figured out what the source of that is. 

> * I haven't seen any difference between reads and writes.  Any access to
> any backing RBD store from the NFS client hangs.

All NFS clients are hung, but in my situation, it's usually only 1-3 local file 
systems that stop accepting writes. NFS is completely unresponsive, but local 
and remote-samba operations on the unaffected file systems are totally happy. 

I don't have a solution to NFS issue, but I've seen it all too often. I wonder 
whether setting a huge number of threads and or playing with client retransmit 
times would help, but I suspect this problem is just intrinsic to Linux NFS 
servers. 

Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network performance

2015-10-22 Thread Jonas Björklund


On Thu, 22 Oct 2015, Udo Lembke wrote:


Hi Jonas,
you can create an bond over multible NICs (depends on your switch which modes 
are possible) to use one IP addresses but
more than one NIC.


Yes, but if an OSD would listen to all IP adresses on the server and use 
all nics we could have more performace and high availability without bonding.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Core dump when running OSD service

2015-10-22 Thread James O'Neill

Hi David,

Thank you for your suggestion. Unfortunately I did not understand what 
was involved and in the process of trying to figure it out I think I 
made it worse. Thankfully it's just a test environment so I just 
rebuilt all the Ceph servers involved and how it's working.


Regards,
James

On Fri, Oct 23, 2015 at 11:18 AM, David Zafman  
wrote:


I was focused on fixing the OSD, but you need to determine if some 
misconfiguration or hardware issue caused a filesystem corruption.


David

On 10/22/15 3:08 PM, David Zafman wrote:


There is a corruption of the osdmaps on this particular OSD.  You 
need determine which maps are bad probably by bumping the osd debug 
level to 20.  Then transfer them from a working OSD.  The newest 
ceph-objectstore-tool has features to write the maps, but you'll 
need to build a version based on a v0.94.4 source tree.  I don't 
know if you can just copy files with names like 
"current/meta/osdmap.8__0_FD6E4D61__none" (map for epoch 8) between 
OSDs.


David

On 10/21/15 8:54 PM, James O'Neill wrote:
I have an OSD that didn't come up after a reboot. I was getting the 
error show below. it was running 0.94.3 so I reinstalled all 
packages. I then upgraded everything to 0.94.4 hoping that would 
fix it but it hasn't. There are three OSDs, this is the only one 
having problems (it also contains the inconsistent pgs). Can anyone 
tell me what the problem might be?



root@dbp-ceph03:/srv/data# ceph status
   cluster 4f6fb784-bd17-4105-a689-e8d1b4bc5643
health HEALTH_ERR
   53 pgs inconsistent
   542 pgs stale
   542 pgs stuck stale
   5 requests are blocked > 32 sec
   85 scrub errors
   too many PGs per OSD (544 > max 300)
   noout flag(s) set
monmap e3: 3 mons at 
{dbp-ceph01=172.17.241.161:6789/0,dbp-ceph02=172.17.241.162:6789/0,dbp-ceph03=172.17.241.163:6789/0}
   election epoch 52, quorum 0,1,2 
dbp-ceph01,dbp-ceph02,dbp-ceph03

osdmap e107: 2 osds: 2 up, 2 in
   flags noout
 pgmap v65678: 1088 pgs, 9 pools, 55199 kB data, 173 objects
   2265 MB used, 16580 MB / 19901 MB avail
546 active+clean
489 stale+active+clean
 53 stale+active+clean+inconsistent


root@dbp-ceph02:~# /usr/bin/ceph-osd --cluster=ceph -i 1 -d
2015-10-22 14:15:48.312507 7f4edabec900 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-osd, pid 
31215
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2015-10-22 14:15:48.352013 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) backend generic (magic 0xef53)
2015-10-22 14:15:48.355621 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-10-22 14:15:48.355655 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-10-22 14:15:48.362016 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-10-22 14:15:48.372819 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) limited size xattrs
2015-10-22 14:15:48.387002 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD 
journal mode: checkpoint is not enabled
2015-10-22 14:15:48.394002 7f4edabec900 -1 journal 
FileJournal::_open: disabling aio for non-block journal. Use 
journal_force_aio to force use of aio anyway
2015-10-22 14:15:48.397803 7f4edabec900 0  
cls/hello/cls_hello.cc:271: loading cls_hello
terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

 what(): buffer::end_of_buffer
*** Caught signal (Aborted) **
in thread 7f4edabec900
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
2015-10-22 14:15:48.412520 7f4edabec900 -1 *** Caught signal 
(Aborted) **

in thread 7f4edabec900

ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_termin

Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-22 Thread hzwuli...@gmail.com
Hi, list

We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420):

15801 libvirt-  20   0 33.7g 1.4g  11m R  1420  0.6   1322:26 qemu-system-x86   
 

quem-system-x86 process 15801 is responsible for the VM.

Anyone has ever run into this problem also.


hzwuli...@gmail.com
 
From: hzwuli...@gmail.com
Date: 2015-10-22 10:15
To: Alexandre DERUMIER
CC: ceph-users
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Hi, 

Sure, all those could help, but not so much -:)
Now, we find it's the VM problem. CPU on the host is very high.

We create a new VM could solve this problem, but don't know why until now.

Here is the detail version info:

Compiled against library: libvirt 1.2.9
Using library: libvirt 1.2.9
Using API: QEMU 1.2.9
Running hypervisor: QEMU 2.1.2

Are there any already know bugs about those version?

Thanks!



hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-21 18:38
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
here a libvirt sample to enable iothreads:
 

   2

  
  
  

 
  
  
  
 

 

 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too)
 
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 10:31:56
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
 
 
Thanks! 
 
hzwuli...@gmail.com 
 
 
 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 
Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 
But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 
BTW, my VM is located on the ceph storage node. 
Anyone can give me more sugestions? 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-20 19:36 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd, 
when I compile qemu with jemalloc 
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01c

Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread John-Paul Robinson
A few clarifications on our experience:

* We have 200+ rbd images mounted on our RBD-NFS gateway.  (There's
nothing easier for a user to understand than "your disk is full".)

* I'd expect more contention potential with a single shared RBD back
end, but with many distinct and presumably isolated backend RBD images,
I've always been surprised that *all* the nfsd task hang.  This leads me
to think  it's an nfsd issue rather than and rbd issue.  (I realize this
is an rbd list, looking for shared experience. ;) )
 
* I haven't seen any difference between reads and writes.  Any access to
any backing RBD store from the NFS client hangs.

~jpr

On 10/22/2015 06:42 PM, Ryan Tokarek wrote:
>> On Oct 22, 2015, at 3:57 PM, John-Paul Robinson  wrote:
>>
>> Hi,
>>
>> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
>> nfsd server requests when their ceph cluster has a placement group that
>> is not servicing I/O for some reason, eg. too few replicas or an osd
>> with slow request warnings?
> We have experienced exactly that kind of problem except that it sometimes 
> happens even when ceph health reports "HEALTH_OK". This has been incredibly 
> vexing for us. 
>
>
> If the cluster is unhealthy for some reason, then I'd expect your/our 
> symptoms as writes can't be completed. 
>
> I'm guessing that you have file systems with barriers turned on. Whichever 
> file system that has a barrier write stuck on the problem pg, will cause any 
> other process trying to write anywhere in that FS also to block. This likely 
> means a cascade of nfsd processes will block as they each try to service 
> various client writes to that FS. Even though, theoretically, the rest of the 
> "disk" (rbd) and other file systems might still be writable, the NFS 
> processes will still be in uninterruptible sleep just because of that stuck 
> write request (or such is my understanding). 
>
> Disabling barriers on the gateway machine might postpone the problem (never 
> tried it and don't want to) until you hit your vm.dirty_bytes or 
> vm.dirty_ratio thresholds, but it is dangerous as you could much more easily 
> lose data. You'd be better off solving the underlying issues when they happen 
> (too few replicas available or overloaded osds). 
>
>
> For us, even when the cluster reports itself as healthy, we sometimes have 
> this problem. All nfsd processes block. sync blocks. echo 3 > 
> /proc/sys/vm/drop_caches blocks. There is a persistent 4-8MB "Dirty" in 
> /proc/meminfo. None of the osds log slow requests. Everything seems fine on 
> the osds and mons. Neither CPU nor I/O load is extraordinary on the ceph 
> nodes, but at least one file system on the gateway machine will stop 
> accepting writes. 
>
> If we just wait, the situation resolves itself in 10 to 30 minutes. A forced 
> reboot of the NFS gateway "solves" the performance problem, but is annoying 
> and dangerous (we unmount all of the file systems that are still unmountable, 
> but the stuck ones lead us to a sysrq-b). 
>
> This is on Scientific Linux 6.7 systems with elrepo 4.1.10 Kernels running 
> Ceph Firefly (0.8.10) and XFS file systems exported over NFS and samba. 
>
> Ryan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
Found that issue; reverted the database to the non-backlog-plugin state,
created a test bug.  Retry?

On 10/22/2015 06:54 PM, Dan Mick wrote:
> I see that too.  I suspect this is because of leftover database columns
> from the backlogs plugin, which is removed.  Looking into it.
> 
> On 10/22/2015 06:43 PM, Kyle Bader wrote:
>> I tried to open a new issue and got this error:
>>
>> Internal error
>>
>> An error occurred on the page you were trying to access.
>> If you continue to experience problems please contact your Redmine
>> administrator for assistance.
>>
>> If you are the Redmine administrator, check your log files for details
>> about the error.
>>
>>
>> On Thu, Oct 22, 2015 at 6:15 PM, Dan Mick  wrote:
>>> Fixed a configuration problem preventing updating issues, and switched
>>> the mailer to use ipv4; if you updated and failed, or missed an email
>>> notification, that may have been why.
>>>
>>> On 10/22/2015 04:51 PM, Dan Mick wrote:
 It's back.  New DNS info is propagating its way around.  If you
 absolutely must get to it, newtracker.ceph.com is the new address, but
 please don't bookmark that, as it will be going away after the transition.

 Please let me know of any problems you have.
>>>
>>> ---
>>> Note: This list is intended for discussions relating to Red Hat Storage 
>>> products, customers and/or support. Discussions on GlusterFS and Ceph 
>>> architecture, design and engineering should go to relevant upstream mailing 
>>> lists.
>>
>>
>>
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
I see that too.  I suspect this is because of leftover database columns
from the backlogs plugin, which is removed.  Looking into it.

On 10/22/2015 06:43 PM, Kyle Bader wrote:
> I tried to open a new issue and got this error:
> 
> Internal error
> 
> An error occurred on the page you were trying to access.
> If you continue to experience problems please contact your Redmine
> administrator for assistance.
> 
> If you are the Redmine administrator, check your log files for details
> about the error.
> 
> 
> On Thu, Oct 22, 2015 at 6:15 PM, Dan Mick  wrote:
>> Fixed a configuration problem preventing updating issues, and switched
>> the mailer to use ipv4; if you updated and failed, or missed an email
>> notification, that may have been why.
>>
>> On 10/22/2015 04:51 PM, Dan Mick wrote:
>>> It's back.  New DNS info is propagating its way around.  If you
>>> absolutely must get to it, newtracker.ceph.com is the new address, but
>>> please don't bookmark that, as it will be going away after the transition.
>>>
>>> Please let me know of any problems you have.
>>
>> ---
>> Note: This list is intended for discussions relating to Red Hat Storage 
>> products, customers and/or support. Discussions on GlusterFS and Ceph 
>> architecture, design and engineering should go to relevant upstream mailing 
>> lists.
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
Fixed a configuration problem preventing updating issues, and switched
the mailer to use ipv4; if you updated and failed, or missed an email
notification, that may have been why.

On 10/22/2015 04:51 PM, Dan Mick wrote:
> It's back.  New DNS info is propagating its way around.  If you
> absolutely must get to it, newtracker.ceph.com is the new address, but
> please don't bookmark that, as it will be going away after the transition.
> 
> Please let me know of any problems you have.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
It's back.  New DNS info is propagating its way around.  If you
absolutely must get to it, newtracker.ceph.com is the new address, but
please don't bookmark that, as it will be going away after the transition.

Please let me know of any problems you have.

On 10/22/2015 04:09 PM, Dan Mick wrote:
> tracker.ceph.com down now
> 
> On 10/22/2015 03:20 PM, Dan Mick wrote:
>> tracker.ceph.com will be brought down today for upgrade and move to a
>> new host.  I plan to do this at about 4PM PST (40 minutes from now).
>> Expect a downtime of about 15-20 minutes.  More notification to follow.
>>
> 
> 


-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread Ryan Tokarek

> On Oct 22, 2015, at 3:57 PM, John-Paul Robinson  wrote:
> 
> Hi,
> 
> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
> nfsd server requests when their ceph cluster has a placement group that
> is not servicing I/O for some reason, eg. too few replicas or an osd
> with slow request warnings?

We have experienced exactly that kind of problem except that it sometimes 
happens even when ceph health reports "HEALTH_OK". This has been incredibly 
vexing for us. 


If the cluster is unhealthy for some reason, then I'd expect your/our symptoms 
as writes can't be completed. 

I'm guessing that you have file systems with barriers turned on. Whichever file 
system that has a barrier write stuck on the problem pg, will cause any other 
process trying to write anywhere in that FS also to block. This likely means a 
cascade of nfsd processes will block as they each try to service various client 
writes to that FS. Even though, theoretically, the rest of the "disk" (rbd) and 
other file systems might still be writable, the NFS processes will still be in 
uninterruptible sleep just because of that stuck write request (or such is my 
understanding). 

Disabling barriers on the gateway machine might postpone the problem (never 
tried it and don't want to) until you hit your vm.dirty_bytes or vm.dirty_ratio 
thresholds, but it is dangerous as you could much more easily lose data. You'd 
be better off solving the underlying issues when they happen (too few replicas 
available or overloaded osds). 


For us, even when the cluster reports itself as healthy, we sometimes have this 
problem. All nfsd processes block. sync blocks. echo 3 > 
/proc/sys/vm/drop_caches blocks. There is a persistent 4-8MB "Dirty" in 
/proc/meminfo. None of the osds log slow requests. Everything seems fine on the 
osds and mons. Neither CPU nor I/O load is extraordinary on the ceph nodes, but 
at least one file system on the gateway machine will stop accepting writes. 

If we just wait, the situation resolves itself in 10 to 30 minutes. A forced 
reboot of the NFS gateway "solves" the performance problem, but is annoying and 
dangerous (we unmount all of the file systems that are still unmountable, but 
the stuck ones lead us to a sysrq-b). 

This is on Scientific Linux 6.7 systems with elrepo 4.1.10 Kernels running Ceph 
Firefly (0.8.10) and XFS file systems exported over NFS and samba. 

Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
tracker.ceph.com down now

On 10/22/2015 03:20 PM, Dan Mick wrote:
> tracker.ceph.com will be brought down today for upgrade and move to a
> new host.  I plan to do this at about 4PM PST (40 minutes from now).
> Expect a downtime of about 15-20 minutes.  More notification to follow.
> 


-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs stuck in active+clean+replay

2015-10-22 Thread Andras Pataki
Hi ceph users,

We’ve upgraded to 0.94.4 (all ceph daemons got restarted) – and are in the 
middle of doing some rebalancing due to crush changes (removing some disks).  
During the rebalance, I see that some placement groups get stuck in 
‘active+clean+replay’ for a long time (essentially until I restart the OSD they 
are on).  All IO for these PGs gets queued, and clients hang.

ceph health details the blocked ops in it:

4 ops are blocked > 2097.15 sec
1 ops are blocked > 131.072 sec
2 ops are blocked > 2097.15 sec on osd.41
2 ops are blocked > 2097.15 sec on osd.119
1 ops are blocked > 131.072 sec on osd.124

ceph pg dump | grep replay
dumped all in format plain
2.121b 3836 0 0 0 0 15705994377 3006 3006 active+clean+replay 2015-10-22 
14:12:01.104564 123840'2258640 125080:1252265 [41,111] 41 [41,111] 41 
114515'2258631 2015-10-20 18:44:09.757620 114515'2258631 2015-10-20 
18:44:09.757620
2.b4 3799 0 0 0 0 15604827445 3003 3003 active+clean+replay 2015-10-22 
13:57:25.490150 119558'2322127 125084:1174759 [119,75] 119 [119,75] 119 
114515'2322124 2015-10-20 11:00:51.448239 114515'2322124 2015-10-17 
09:22:14.676006

Both osd.41 and OSD.119 are doing this “replay”.

The end of the log of osd.41:

2015-10-22 10:44:35.727000 7f037929b700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.170:6913/121602 pipe(0x3b4d sd=125 :6827 s=2 pgs=618 cs=1 l=0 
c=0x374398c0).fault with nothing to send, going to standby
2015-10-22 10:50:25.954404 7f038adae700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3adff000 sd=229 :6827 s=2 pgs=94 cs=3 l=0 
c=0x3e9d0940).fault with nothing to send, going to standby
2015-10-22 12:11:28.029214 7f03a0e0d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.106:6864/102556 pipe(0x40afe000 sd=621 :6827 s=2 pgs=91 cs=3 l=0 
c=0x3acf5860).fault with nothing to send, going to standby
2015-10-22 12:45:45.404765 7f038050d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3cec0).accept connect_seq 1 vs existing 1 state standby
2015-10-22 12:45:45.405232 7f038050d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3cec0).accept connect_seq 2 vs existing 1 state standby
2015-10-22 12:52:49.062752 7f036525c700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3ba20).accept connect_seq 3 vs existing 3 state standby
2015-10-22 12:52:49.063169 7f036525c700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3ba20).accept connect_seq 4 vs existing 3 state standby
2015-10-22 13:02:16.573546 7f038050d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=2 pgs=110 cs=3 l=0 
c=0x37b92000).fault with nothing to send, going to standby
2015-10-22 13:07:58.667432 7f036525c700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=2 pgs=146 cs=5 l=0 
c=0x3e9d0940).fault with nothing to send, going to standby
2015-10-22 13:25:35.020722 7f038191a700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.111:6841/71447 pipe(0x3e78e000 sd=205 :6827 s=2 pgs=82 cs=3 l=0 
c=0x36bf5860).fault with nothing to send, going to standby
2015-10-22 13:45:48.610068 7f0361620700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6841/99063 pipe(0x3e43b000 sd=539 :6827 s=0 pgs=0 cs=0 l=0 
c=0x373e11e0).accept we reset (peer sent cseq 1), sending RESETSESSION
2015-10-22 13:45:48.880698 7f0361620700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6841/99063 pipe(0x3e43b000 sd=539 :6827 s=2 pgs=199 cs=1 l=0 
c=0x373e11e0).reader missed message?  skipped from seq 0 to 825623574
2015-10-22 14:11:32.967937 7f035d9e4700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6802/98037 pipe(0x3ce82000 sd=63 :43711 s=2 pgs=144 cs=3 l=0 
c=0x3bf8c100).fault with nothing to send, going to standby
2015-10-22 14:12:35.338635 7f03afffb700  0 log_channel(cluster) log [WRN] : 2 
slow requests, 2 included below; oldest blocked for > 30.079053 secs
2015-10-22 14:12:35.338875 7f03afffb700  0 log_channel(cluster) log [WRN] : 
slow request 30.079053 seconds old, received at 2015-10-22 14:12:05.259156: 
osd_op(client.734338.0:50618164 10b8f73.03ef [read 0~65536] 2.338a921b 
ack+read+known_if_redirected e124995) currently waiting for replay end
2015-10-22 14:12:35.339050 7f03afffb700  0 log_channel(cluster) log [WRN] : 
slow request 30.063995 seconds old, received at 2015-10-22 14:12:05.274213: 
osd_op(client.734338.0:50618166 10b8f73.03ef [read 65536~131072] 
2.338a921b ack+read+known_if_redirected e124995) currently waiting for replay 
end
2015-10-22 14:13:11.817243 7f03afffb700  0 log_channel(cluster) log [WRN] : 2 
slow requests, 2 included below; oldest blocked for > 66.557970 secs
2015-10-22 14:13:11.817408 7f03afffb700  0 log_channel(cluster) log [WRN] : 
slow request 66.557970 seconds old, received at 2015-10-22 14:12:05.259156: 
osd_op

[ceph-users] tracker.ceph.com downtime today

2015-10-22 Thread Dan Mick
tracker.ceph.com will be brought down today for upgrade and move to a
new host.  I plan to do this at about 4PM PST (40 minutes from now).
Expect a downtime of about 15-20 minutes.  More notification to follow.

-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Core dump when running OSD service

2015-10-22 Thread David Zafman


I was focused on fixing the OSD, but you need to determine if some 
misconfiguration or hardware issue caused a filesystem corruption.


David

On 10/22/15 3:08 PM, David Zafman wrote:


There is a corruption of the osdmaps on this particular OSD.  You need 
determine which maps are bad probably by bumping the osd debug level 
to 20.  Then transfer them from a working OSD.  The newest 
ceph-objectstore-tool has features to write the maps, but you'll need 
to build a version based on a v0.94.4 source tree.  I don't know if 
you can just copy files with names like 
"current/meta/osdmap.8__0_FD6E4D61__none" (map for epoch 8) between OSDs.


David

On 10/21/15 8:54 PM, James O'Neill wrote:
I have an OSD that didn't come up after a reboot. I was getting the 
error show below. it was running 0.94.3 so I reinstalled all 
packages. I then upgraded everything to 0.94.4 hoping that would fix 
it but it hasn't. There are three OSDs, this is the only one having 
problems (it also contains the inconsistent pgs). Can anyone tell me 
what the problem might be?



root@dbp-ceph03:/srv/data# ceph status
   cluster 4f6fb784-bd17-4105-a689-e8d1b4bc5643
health HEALTH_ERR
   53 pgs inconsistent
   542 pgs stale
   542 pgs stuck stale
   5 requests are blocked > 32 sec
   85 scrub errors
   too many PGs per OSD (544 > max 300)
   noout flag(s) set
monmap e3: 3 mons at 
{dbp-ceph01=172.17.241.161:6789/0,dbp-ceph02=172.17.241.162:6789/0,dbp-ceph03=172.17.241.163:6789/0}
   election epoch 52, quorum 0,1,2 
dbp-ceph01,dbp-ceph02,dbp-ceph03

osdmap e107: 2 osds: 2 up, 2 in
   flags noout
 pgmap v65678: 1088 pgs, 9 pools, 55199 kB data, 173 objects
   2265 MB used, 16580 MB / 19901 MB avail
546 active+clean
489 stale+active+clean
 53 stale+active+clean+inconsistent


root@dbp-ceph02:~# /usr/bin/ceph-osd --cluster=ceph -i 1 -d
2015-10-22 14:15:48.312507 7f4edabec900 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-osd, pid 31215
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2015-10-22 14:15:48.352013 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) backend generic (magic 0xef53)
2015-10-22 14:15:48.355621 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-10-22 14:15:48.355655 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-10-22 14:15:48.362016 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-10-22 14:15:48.372819 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) limited size xattrs
2015-10-22 14:15:48.387002 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD 
journal mode: checkpoint is not enabled
2015-10-22 14:15:48.394002 7f4edabec900 -1 journal 
FileJournal::_open: disabling aio for non-block journal. Use 
journal_force_aio to force use of aio anyway
2015-10-22 14:15:48.397803 7f4edabec900 0  
cls/hello/cls_hello.cc:271: loading cls_hello
terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

 what(): buffer::end_of_buffer
*** Caught signal (Aborted) **
in thread 7f4edabec900
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
2015-10-22 14:15:48.412520 7f4edabec900 -1 *** Caught signal 
(Aborted) **

in thread 7f4edabec900

ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDServi

Re: [ceph-users] Core dump when running OSD service

2015-10-22 Thread David Zafman


There is a corruption of the osdmaps on this particular OSD.  You need 
determine which maps are bad probably by bumping the osd debug level to 
20.  Then transfer them from a working OSD.  The newest 
ceph-objectstore-tool has features to write the maps, but you'll need to 
build a version based on a v0.94.4 source tree.  I don't know if you can 
just copy files with names like 
"current/meta/osdmap.8__0_FD6E4D61__none" (map for epoch 8) between OSDs.


David

On 10/21/15 8:54 PM, James O'Neill wrote:
I have an OSD that didn't come up after a reboot. I was getting the 
error show below. it was running 0.94.3 so I reinstalled all packages. 
I then upgraded everything to 0.94.4 hoping that would fix it but it 
hasn't. There are three OSDs, this is the only one having problems (it 
also contains the inconsistent pgs). Can anyone tell me what the 
problem might be?



root@dbp-ceph03:/srv/data# ceph status
   cluster 4f6fb784-bd17-4105-a689-e8d1b4bc5643
health HEALTH_ERR
   53 pgs inconsistent
   542 pgs stale
   542 pgs stuck stale
   5 requests are blocked > 32 sec
   85 scrub errors
   too many PGs per OSD (544 > max 300)
   noout flag(s) set
monmap e3: 3 mons at 
{dbp-ceph01=172.17.241.161:6789/0,dbp-ceph02=172.17.241.162:6789/0,dbp-ceph03=172.17.241.163:6789/0}
   election epoch 52, quorum 0,1,2 
dbp-ceph01,dbp-ceph02,dbp-ceph03

osdmap e107: 2 osds: 2 up, 2 in
   flags noout
 pgmap v65678: 1088 pgs, 9 pools, 55199 kB data, 173 objects
   2265 MB used, 16580 MB / 19901 MB avail
546 active+clean
489 stale+active+clean
 53 stale+active+clean+inconsistent


root@dbp-ceph02:~# /usr/bin/ceph-osd --cluster=ceph -i 1 -d
2015-10-22 14:15:48.312507 7f4edabec900 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-osd, pid 31215
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2015-10-22 14:15:48.352013 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) backend generic (magic 0xef53)
2015-10-22 14:15:48.355621 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-10-22 14:15:48.355655 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-10-22 14:15:48.362016 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-10-22 14:15:48.372819 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) limited size xattrs
2015-10-22 14:15:48.387002 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2015-10-22 14:15:48.394002 7f4edabec900 -1 journal FileJournal::_open: 
disabling aio for non-block journal. Use journal_force_aio to force 
use of aio anyway
2015-10-22 14:15:48.397803 7f4edabec900 0  
cls/hello/cls_hello.cc:271: loading cls_hello
terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

 what(): buffer::end_of_buffer
*** Caught signal (Aborted) **
in thread 7f4edabec900
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
2015-10-22 14:15:48.412520 7f4edabec900 -1 *** Caught signal (Aborted) **
in thread 7f4edabec900

ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_ma

Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread John-Paul Robinson


On 10/22/2015 04:03 PM, Wido den Hollander wrote:
> On 10/22/2015 10:57 PM, John-Paul Robinson wrote:
>> Hi,
>>
>> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
>> nfsd server requests when their ceph cluster has a placement group that
>> is not servicing I/O for some reason, eg. too few replicas or an osd
>> with slow request warnings?
>>
>> We have an RBD-NFS gateway that stops responding to NFS clients
>> (interaction with RBD-backed NFS shares hang on the NFS client),
>> whenever our ceph cluster has some part of it in an I/O block
>> condition.   This issue only affects the ability of the nfsd processes
>> to serve requests to the client.  I can look at and access underlying
>> mounted RBD containers without issue, although they appear hung from the
>> NFS client side.   The gateway node load numbers spike to a number that
>> reflects the number of nfsd processes, but the system is otherwise
>> untaxed (unlike the case in a normal high os load, ie. i can type and
>> run commands with normal responsiveness.)
>>
> Well, that is normal I think. Certain objects become unresponsive if a
> PG is not serving I/O.
>
> With a simple 'ls' or 'df -h' you might not be touching those objects,
> so for you it seems like everything is functioning.
>
> The nfsd process however might be hung due to a blocking I/O call. That
> is completely normal and to be excpected.

I agree that an nfsd process blocking on a blocked backend I/O request
is expected an normal.

> That it hangs the complete NFS server might be just a side-effect on how
> nfsd was written.

Hanging all nfsd processes is the part I find unexpected.  I'm just
wondering is someone has experience with this or if this is a known nfsd
issue.

> It might be that Ganesha works better for you:
> http://blog.widodh.nl/2014/12/nfs-ganesha-with-libcephfs-on-ubuntu-14-04/

Thanks genesha looks very interesting!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [0.94.4] radosgw initialization timeout, failed to initialize

2015-10-22 Thread James O'Neill
I upgraded to 0.94.4 yesterday and now radosgw will not run on any of 
the servers. The service itself will run, but it's not listening (using 
civetweb with port 80 specified). Run manually I get the follow output:


root@dbp-ceph01:~# /usr/bin/radosgw --cluster=ceph --id radosgw.gateway 
-d
2015-10-23 08:23:31.364333 7fb5fba58840 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process radosgw, pid 17928
2015-10-23 08:23:31.380126 7fb5fba56700 0 -- :/1017928 >> 
127.0.0.1:6789/0 pipe(0x2fc8aa0 sd=7 :0 s=1 pgs=0 cs=0 l=1 
c=0x2fccd90).fault
2015-10-23 08:23:34.389209 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:23:34.389933 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0490 obj=.rgw.root:default.region state=0x2fc9b00 
s->prefetch_data=0
2015-10-23 08:23:34.397513 7fb5fba58840 20 get_obj_state: s->obj_tag 
was set empty
2015-10-23 08:23:34.397696 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0490 obj=.rgw.root:default.region state=0x2fc9b00 
s->prefetch_data=0

2015-10-23 08:23:34.397767 7fb5fba58840 20 rados->read ofs=0 len=524288
2015-10-23 08:23:34.399305 7fb5fba58840 20 rados->read r=0 bl.length=17
2015-10-23 08:23:34.399553 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0490 obj=.rgw.root:region_info.default state=0x2fc9ad0 
s->prefetch_data=0
2015-10-23 08:23:34.401343 7fb5fba58840 20 get_obj_state: s->obj_tag 
was set empty
2015-10-23 08:23:34.401516 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0490 obj=.rgw.root:region_info.default state=0x2fc9ad0 
s->prefetch_data=0

2015-10-23 08:23:34.401585 7fb5fba58840 20 rados->read ofs=0 len=524288
2015-10-23 08:23:34.403133 7fb5fba58840 20 rados->read r=0 bl.length=153
2015-10-23 08:23:34.403218 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0630 obj=.rgw.root:zone_info.default state=0x2fc9ad0 
s->prefetch_data=0
2015-10-23 08:23:34.405494 7fb5fba58840 20 get_obj_state: s->obj_tag 
was set empty
2015-10-23 08:23:34.405528 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0630 obj=.rgw.root:zone_info.default state=0x2fc9ad0 
s->prefetch_data=0

2015-10-23 08:23:34.405535 7fb5fba58840 20 rados->read ofs=0 len=524288
2015-10-23 08:23:34.407244 7fb5fba58840 20 rados->read r=0 bl.length=678
2015-10-23 08:23:34.407322 7fb5fba58840 2 zone default is master
2015-10-23 08:23:34.407490 7fb5fba58840 20 get_obj_state: 
rctx=0x7fffab0b0630 obj=.rgw.root:region_map state=0x2fc9ad0 
s->prefetch_data=0
2015-10-23 08:23:56.389610 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:24:18.389851 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:24:40.390144 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:25:02.390385 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:25:24.390609 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:25:46.390908 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:26:08.391126 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:26:30.391357 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:26:52.391679 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:27:14.391980 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:27:36.392259 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:27:58.392475 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:28:20.392762 7fb5e5ffb700 2 
RGWDataChangesLog::ChangesRenewThread: start
2015-10-23 08:28:31.364740 7fb5f0204700 -1 Initialization timeout, 
failed to initialize



Between each of those ChangesRenewThread entries are lines like:

2015-10-23 08:35:08.379161 7f6efaffd700 1 -- 172.17.241.161:0/1018414 
--> 172.17.241.163:6800/31263 -- ping magic: 0 v1 -- ?+0 0x7f6ecc000ab0 
con 0x3947710



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread Wido den Hollander
On 10/22/2015 10:57 PM, John-Paul Robinson wrote:
> Hi,
> 
> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
> nfsd server requests when their ceph cluster has a placement group that
> is not servicing I/O for some reason, eg. too few replicas or an osd
> with slow request warnings?
> 
> We have an RBD-NFS gateway that stops responding to NFS clients
> (interaction with RBD-backed NFS shares hang on the NFS client),
> whenever our ceph cluster has some part of it in an I/O block
> condition.   This issue only affects the ability of the nfsd processes
> to serve requests to the client.  I can look at and access underlying
> mounted RBD containers without issue, although they appear hung from the
> NFS client side.   The gateway node load numbers spike to a number that
> reflects the number of nfsd processes, but the system is otherwise
> untaxed (unlike the case in a normal high os load, ie. i can type and
> run commands with normal responsiveness.)
> 

Well, that is normal I think. Certain objects become unresponsive if a
PG is not serving I/O.

With a simple 'ls' or 'df -h' you might not be touching those objects,
so for you it seems like everything is functioning.

The nfsd process however might be hung due to a blocking I/O call. That
is completely normal and to be excpected.

That it hangs the complete NFS server might be just a side-effect on how
nfsd was written.

It might be that Ganesha works better for you:
http://blog.widodh.nl/2014/12/nfs-ganesha-with-libcephfs-on-ubuntu-14-04/

> The behavior comes accross like there is some nfsd global lock that an
> nfsd sets before requesting I/O from a backend device.  In the case
> above, the I/O request hangs on one RBD image affected by the I/O block
> caused by the problematic pg or OSD.   The nfsd request blocks on the
> ceph I/O and because it has set a global lock, all other nfsd processes
> are prevented from servicing requests to their clients.  The nfsd
> processes are now all in the wait queue causing the load number on the
> gateway system to spike. Once the Ceph I/O issues is resolved, the nfsd
> I/O request completes and all service returns to normal.  The load on
> the gateway drops to normal immediately and all NFS clients can again
> interact with the nfsd processes.  Thoughout this time unaffected ceph
> objects remain available to other clients, eg. OpenStack volumes.
> 
> Our RBD-NFS gateway is running Ubuntu 12.04.5 with kernel
> 3.11.0-15-generic.  The ceph version installed on this client is 0.72.2,
> though I assume only the kernel resident RBD module matters.
> 
> Any thoughts or pointers appreciated.
> 
> ~jpr
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd unmap immediately consistent?

2015-10-22 Thread Allen Liao
Does ceph guarantee image consistency if an rbd image is unmapped on one
machine then immediately mapped on another machine?  If so, does the same
apply to issuing a snapshot command on machine B as soon as the unmap
command finishes on machine A?

In other words, does the unmap operation flush all changes to the ceph
cluster synchronously?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread John-Paul Robinson
Hi,

Has anyone else experienced a problem with RBD-to-NFS gateways blocking
nfsd server requests when their ceph cluster has a placement group that
is not servicing I/O for some reason, eg. too few replicas or an osd
with slow request warnings?

We have an RBD-NFS gateway that stops responding to NFS clients
(interaction with RBD-backed NFS shares hang on the NFS client),
whenever our ceph cluster has some part of it in an I/O block
condition.   This issue only affects the ability of the nfsd processes
to serve requests to the client.  I can look at and access underlying
mounted RBD containers without issue, although they appear hung from the
NFS client side.   The gateway node load numbers spike to a number that
reflects the number of nfsd processes, but the system is otherwise
untaxed (unlike the case in a normal high os load, ie. i can type and
run commands with normal responsiveness.)

The behavior comes accross like there is some nfsd global lock that an
nfsd sets before requesting I/O from a backend device.  In the case
above, the I/O request hangs on one RBD image affected by the I/O block
caused by the problematic pg or OSD.   The nfsd request blocks on the
ceph I/O and because it has set a global lock, all other nfsd processes
are prevented from servicing requests to their clients.  The nfsd
processes are now all in the wait queue causing the load number on the
gateway system to spike. Once the Ceph I/O issues is resolved, the nfsd
I/O request completes and all service returns to normal.  The load on
the gateway drops to normal immediately and all NFS clients can again
interact with the nfsd processes.  Thoughout this time unaffected ceph
objects remain available to other clients, eg. OpenStack volumes.

Our RBD-NFS gateway is running Ubuntu 12.04.5 with kernel
3.11.0-15-generic.  The ceph version installed on this client is 0.72.2,
though I assume only the kernel resident RBD module matters.

Any thoughts or pointers appreciated.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg incomplete state

2015-10-22 Thread John-Paul Robinson
Greg,

Thanks for providing this background on the incomplete state.

With that context, and a little more digging online and in our
environment, I was able to resolve the issue. My cluster is back in
health ok.

The key to fixing the incomplete state was the information provided by
pg query.  I did not have to change the min_size setting.  In addition
to your comments, these two references were helpful.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-August/042102.html
http://tracker.ceph.com/issues/5226


The tail of `ceph pg 3.ea query` showed there were an number of osds
involved in servicing the backfill.

 "probing_osds": [
10,
11,
30,
37,
39,
54],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []},
{ "name": "Started",
  "enter_time": "2015-10-21 14:39:13.824613"}]}

After checking all the OSDs, I confirmed that only osd.11 had the pg
data and all the rest had an empty dir for pg 3.ea.  Because osd 10 was
listed first and had an empty copy of the pg, my assumption was it was
blocking the backfill.  I stopped osd.10 briefly and the state of pg
3.ea immediately entered "active+degraded+remapped+backfilling".  After
the backfill started started osd.10.  In particular, osd 11 became the
primary (as desired) and began backfilling osd 30.

 { "state": "active+degraded+remapped+backfilling",
  "up": [
30,
11],
  "acting": [
11,
30],

osd.10 was no longer holding up the start of backfill operation:

"recovery_state": [
{ "name": "Started\/Primary\/Active",
  "enter_time": "2015-10-22 12:46:50.907955",
  "might_have_unfound": [
{ "osd": 10,
  "status": "not queried"}],
  "recovery_progress": { "backfill_target": 30,
  "waiting_on_backfill": 0,

Based on the steps that triggered the original incomplete state, my
guess is that when I took osd.30 down and out to reformat, a number of
alternates (including osd.10)  were mapped as backfill targets for the
pg.  These operations didn't have a chance to start up before osd 30's
reformat completed and was back in the cluster.  At that point, pg 3.ea
was remapped again, leaving osd 10 at the top of the list.  Not having
any data, it blocked the backfill from osd 11 from starting.

Not sure if that was the exact cause, but it makes some sense.

Thanks again for the pointing me in a useful direction.

~jpr

On 10/21/2015 03:01 PM, Gregory Farnum wrote:
> I don't remember the exact timeline, but min_size is designed to
> prevent data loss from under-replicated objects (ie, if you only have
> 1 copy out of 3 and you lose that copy, you're in trouble, so maybe
> you don't want it to go active). Unfortunately it could also prevent
> the OSDs from replicating/backfilling the data to new OSDs in the case
> where you only had one copy left — that's fixed now, but wasn't
> initially. And in that case it reported the PG as incomplete (in later
> versions, PGs in this state get reported as undersized).
>
> So if you drop the min_size to 1, it will allow new writes to the PG
> (which might not be great), but it will also let the OSD go into the
> backfilling state. (At least, assuming the number of replicas is the
> only problem.). Based on your description of the problem I think this
> is the state you're in, and decreasing min_size is the solution.
> *shrug*
> You could also try and do something like extracting the PG from osd.11
> and copying it to osd.30, but that's quite tricky without the modern
> objectstore tool stuff, and I don't know if any of that works on
> dumpling (which it sounds like you're on — incidentally, you probably
> want to upgrade from that).
> -Greg
>
> On Wed, Oct 21, 2015 at 12:55 PM, John-Paul Robinson  wrote:
>> Greg,
>>
>> Thanks for the insight.  I suspect things are somewhat sane given that I
>> did erase the primary (osd.30) and the secondary (osd.11) still contains
>> pg data.
>>
>> If I may, could you clarify the process of backfill a little?
>>
>> I understand the min_size allows I/O on the object to resume while there
>> are only that many replicas (ie. 1 once changed) and this would let
>> things move forward.
>>
>> I would expect, however, that some backfill would already be on-going
>> for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
>> happening.  The pg 3.ea directory is just as empty today as it was
>> yesterday.
>>
>> Will changing the min_size actually trigger backfill to begin for an
>> object if has stalled or never got started?
>>
>> An alternative idea I had was to take osd.30 back out of the cluster so
>> that pg 3.ae [30,11] would get mapped to some other osd to maintain
>> replication.  This seems a bit heavy handed though, given that only this
>> one pg is affected.
>>
>> Thanks for any follow up.
>>
>> ~jpr
>>
>>
>> On 10/21/2015 01:21 PM, Gregory Farnum wr

Re: [ceph-users] ceph-deploy for "deb http://ceph.com/debian-hammer/ trusty main"

2015-10-22 Thread David Clarke
On 23/10/15 09:08, Kjetil Jørgensen wrote:
> Hi,
> 
> this seems to not get me ceph-deploy from ceph.com .
> 
> http://download.ceph.com/debian-hammer/pool/main/c/ceph/ceph_0.94.4-1trusty_amd64.deb
> does seem to contain /usr/share/man/man8/ceph-deploy.8.gz, which
> conflicts with ceph-deploy from elsewhere (ubuntu).
> 
> Looking at:
> http://download.ceph.com/debian-hammer/dists/trusty/main/binary-amd64/Packages
> There's no ceph-deploy package in there (There is if you replace hammer
> with giant, there is).
> 
> Is ceph-deploy-by-debian-package from ceph.com 
> discontinued ?

This has been raised in the bug tracker a couple of times:

http://tracker.ceph.com/issues/13544
http://tracker.ceph.com/issues/13548

The actual .deb files are in the repository, but not mentioned in the
Packages files, so it looks like something has gone awry with the
repository build scripts.

A direct download in available at:

http://download.ceph.com/debian-hammer/pool/main/c/ceph-deploy/ceph-deploy_1.5.28trusty_all.deb

That version does not include /usr/share/man/man8/ceph-deploy.8.gz, and
so does not cause issues when installed alongside ceph 0.94.4.


-- 
David Clarke
Systems Architect
Catalyst IT



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy for "deb http://ceph.com/debian-hammer/ trusty main"

2015-10-22 Thread Kjetil Jørgensen
Hi,

this seems to not get me ceph-deploy from ceph.com.

http://download.ceph.com/debian-hammer/pool/main/c/ceph/ceph_0.94.4-1trusty_amd64.deb
does seem to contain /usr/share/man/man8/ceph-deploy.8.gz, which conflicts
with ceph-deploy from elsewhere (ubuntu).

Looking at:
http://download.ceph.com/debian-hammer/dists/trusty/main/binary-amd64/Packages
There's no ceph-deploy package in there (There is if you replace hammer
with giant, there is).

Is ceph-deploy-by-debian-package from ceph.com discontinued ?

Cheers,
-- 
Kjetil Joergensen 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with ceph_rest_api after update

2015-10-22 Thread Jon Heese
John,

Aha, thanks for that -- that got me closer to the problem.

I forgot an important detail: A few days before the upgrade, I set the cluster 
and public networks in the config files on the nodes to the "back-end" network, 
which the MON nodes don't have access to.  I suspected that this was a bad idea 
at the time, but since it didn't break anything (we are still in test mode on 
this cluster, so downtime is completely fine), I figured it somehow didn’t 
matter.  I must have forgotten to restart the ceph service on the MONs so the 
symptom didn't appear until the ceph upgrade.

I just switched the public network back to the "front-end network", which the 
MONs do have access to, and now the ceph_rest_api runs fine (and your "tell 
osd.0 version" does as well).  So that problem's solved.

But now we're back to the original problem, which is why I was monkeying with 
the "public network" config entry to begin with.  Let me explain:

As I said, we have two separate networks:

10.197.5.0/24 - The "front-end" network, "skinny pipe", all 1Gbe, intended to 
be a management or control plane network
10.174.1.0/24 - The "back-end" network, "fat pipe", all OSD nodes use 2x bonded 
10Gbe, intended to be a data network

So we want all of the OSD traffic to go over the "back end", and the MON 
traffic to go over the "front end".  We thought the following would do that:

public network = 10.197.5.0/24   # skinny pipe, mgmt & MON traffic
cluster network = 10.174.1.0/24  # fat pipe, OSD traffic

But that doesn't seem to be the case -- iftop and netstat show that little/no 
OSD communication is happening over the 10.174.1 network and it's all happening 
over the 10.197.5 network.

What configuration should we be running to enforce the networks per our design? 
 Thanks!

Jon Heese
Systems Engineer
INetU Managed Hosting
P: 610.266.7441 x 261
F: 610.266.7434
www.inetu.net

** This message contains confidential information, which also may be 
privileged, and is intended only for the person(s) addressed above. Any 
unauthorized use, distribution, copying or disclosure of confidential and/or 
privileged information is strictly prohibited. If you have received this 
communication in error, please erase all copies of the message and its 
attachments and notify the sender immediately via reply e-mail. **
-Original Message-
From: John Spray [mailto:jsp...@redhat.com] 
Sent: Thursday, October 22, 2015 12:48 PM
To: Jon Heese 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Problems with ceph_rest_api after update

On Thu, Oct 22, 2015 at 3:36 PM, Jon Heese  wrote:
> Hello,
>
>
>
> We are running a Ceph cluster with 3x CentOS 7 MON nodes, and after we 
> updated the ceph packages on the MONs yesterday (from 0.94.3 to 
> 0.94.4), the ceph_rest_api started refusing to run, giving the 
> following error 30 seconds after it’s started:

Weird.  Does this work?
"ceph --id admin tell osd.0 version"

get_command_descriptions is ceph_rest_api's way of asking an OSD to tell it 
what operations are supported.  It's sent from ceph_rest_api to an OSD the same 
way a 'tell' command is sent from the CLI (although you can't actually issue 
get_command_descriptions with the CLI).

ceph_rest_api is picking the last up OSD it can see, as an arbitrary place to 
send the query, so if you have for example an up OSD that isn't really 
responsive, it could cause a problem.

John

>
>
>
> [root@ceph-mon01 ~]# /usr/bin/ceph-rest-api -c /etc/ceph/ceph.conf 
> --cluster ceph -i admin
>
> Traceback (most recent call last):
>
>   File "/usr/bin/ceph-rest-api", line 59, in 
>
> rest,
>
>   File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 503, 
> in generate_app
>
> addr, port = api_setup(app, conf, cluster, clientname, clientid, 
> args)
>
>   File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 145, 
> in api_setup
>
> target=('osd', int(osdid)))
>
>   File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 83, 
> in get_command_descriptions
>
> raise EnvironmentError(ret, err)
>
> EnvironmentError: [Errno -4] Can't get command descriptions:
>
>
>
> Nothing else was changed, only the packages were updated.  I’ve looked 
> at the python, and it seems to be timing out waiting for this line to 
> complete, but I’m not sure where to look next in terms of what 
> “get_command_descriptions” actually does:
>
>
>
> ret, outbuf, outs = json_command(cluster, target,
>
>  
> prefix='get_command_descriptions',
>
>  timeout=30)
>
>
>
> Is this a known issue?  If not, does anyone have any suggestions of 
> how to further troubleshoot this further?  Thanks in advance.
>
>
>
> Jon Heese
> Systems Engineer
> INetU Managed Hosting
> P: 610.266.7441 x 261
> F: 610.266.7434
> www.inetu.net
>
> ** This message contains confidential information, which also may be 
> privileged, and is intended only for the person(s) addressed above. 
> Any un

Re: [ceph-users] Problems with ceph_rest_api after update

2015-10-22 Thread John Spray
On Thu, Oct 22, 2015 at 3:36 PM, Jon Heese  wrote:
> Hello,
>
>
>
> We are running a Ceph cluster with 3x CentOS 7 MON nodes, and after we
> updated the ceph packages on the MONs yesterday (from 0.94.3 to 0.94.4), the
> ceph_rest_api started refusing to run, giving the following error 30 seconds
> after it’s started:

Weird.  Does this work?
"ceph --id admin tell osd.0 version"

get_command_descriptions is ceph_rest_api's way of asking an OSD to
tell it what operations are supported.  It's sent from ceph_rest_api
to an OSD the same way a 'tell' command is sent from the CLI (although
you can't actually issue get_command_descriptions with the CLI).

ceph_rest_api is picking the last up OSD it can see, as an arbitrary
place to send the query, so if you have for example an up OSD that
isn't really responsive, it could cause a problem.

John

>
>
>
> [root@ceph-mon01 ~]# /usr/bin/ceph-rest-api -c /etc/ceph/ceph.conf --cluster
> ceph -i admin
>
> Traceback (most recent call last):
>
>   File "/usr/bin/ceph-rest-api", line 59, in 
>
> rest,
>
>   File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 503, in
> generate_app
>
> addr, port = api_setup(app, conf, cluster, clientname, clientid, args)
>
>   File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 145, in
> api_setup
>
> target=('osd', int(osdid)))
>
>   File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 83, in
> get_command_descriptions
>
> raise EnvironmentError(ret, err)
>
> EnvironmentError: [Errno -4] Can't get command descriptions:
>
>
>
> Nothing else was changed, only the packages were updated.  I’ve looked at
> the python, and it seems to be timing out waiting for this line to complete,
> but I’m not sure where to look next in terms of what
> “get_command_descriptions” actually does:
>
>
>
> ret, outbuf, outs = json_command(cluster, target,
>
>  prefix='get_command_descriptions',
>
>  timeout=30)
>
>
>
> Is this a known issue?  If not, does anyone have any suggestions of how to
> further troubleshoot this further?  Thanks in advance.
>
>
>
> Jon Heese
> Systems Engineer
> INetU Managed Hosting
> P: 610.266.7441 x 261
> F: 610.266.7434
> www.inetu.net
>
> ** This message contains confidential information, which also may be
> privileged, and is intended only for the person(s) addressed above. Any
> unauthorized use, distribution, copying or disclosure of confidential and/or
> privileged information is strictly prohibited. If you have received this
> communication in error, please erase all copies of the message and its
> attachments and notify the sender immediately via reply e-mail. **
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and upgrading OS version

2015-10-22 Thread Andrei Mikhailovsky
Thanks Luis, 

I was hoping I could do dist-upgrade one node at a time. my cluster is very 
small, only 18 osds between the two osd servers and three mons 


I guess it should be safe to shutdown ceph on one of the osd servers, do the 
upgrade, reboot, wait for PGs to become active+clean and follow the procedure 
on the second osd server. 

Do you think this could work? 

Performance wise, i do not have a great IO demand, in particular over a 
weekend. 

Thanks 

Andrei 
- Original Message -

From: "Luis Periquito"  
To: "Andrei Mikhailovsky"  
Cc: "ceph-users"  
Sent: Thursday, 22 October, 2015 12:42:36 PM 
Subject: Re: [ceph-users] ceph and upgrading OS version 

There are several routes you can follow for this work. The best one 
will depend on cluster size, current data, pool definition (size), 
performance expectations, etc. 

They range from doing dist-upgrade a node at a time, to 
remove-upgrade-then-add nodes to the cluster. But knowing that ceph is 
"self-healing" if you aren't somewhat careful you can do an upgrade 
online without much disruption if any (performance will always be 
impacted). 

On Thu, Oct 22, 2015 at 12:22 PM, Andrei Mikhailovsky  
wrote: 
> 
> Any thoughts anyone? 
> 
> Is it safe to perform OS version upgrade on the osd and mon servers? 
> 
> Thanks 
> 
> Andrei 
> 
>  
> From: "Andrei Mikhailovsky"  
> To: ceph-us...@ceph.com 
> Sent: Tuesday, 20 October, 2015 8:05:19 PM 
> Subject: [ceph-users] ceph and upgrading OS version 
> 
> Hello everyone 
> 
> I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am 
> wondering if you have a recommended process of upgrading the OS version 
> without causing any issues to the ceph cluster? 
> 
> Many thanks 
> 
> Andrei 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] when an osd is started up, IO will be blocked

2015-10-22 Thread wangsongbo

Hi all,

When an osd is started, relative IO will be blocked.
According to the test result,the larger iops the clients send , the 
longer it will take to elapse.
Adjustment on all the parameters associate with recovery operations was 
also found useless.


How to reduce the impact of this process on the IO ?

Thanks and Regards,
WangSongbo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems with ceph_rest_api after update

2015-10-22 Thread Jon Heese
Hello,

We are running a Ceph cluster with 3x CentOS 7 MON nodes, and after we updated 
the ceph packages on the MONs yesterday (from 0.94.3 to 0.94.4), the 
ceph_rest_api started refusing to run, giving the following error 30 seconds 
after it's started:

[root@ceph-mon01 ~]# /usr/bin/ceph-rest-api -c /etc/ceph/ceph.conf --cluster 
ceph -i admin
Traceback (most recent call last):
  File "/usr/bin/ceph-rest-api", line 59, in 
rest,
  File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 503, in 
generate_app
addr, port = api_setup(app, conf, cluster, clientname, clientid, args)
  File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 145, in 
api_setup
target=('osd', int(osdid)))
  File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 83, in 
get_command_descriptions
raise EnvironmentError(ret, err)
EnvironmentError: [Errno -4] Can't get command descriptions:

Nothing else was changed, only the packages were updated.  I've looked at the 
python, and it seems to be timing out waiting for this line to complete, but 
I'm not sure where to look next in terms of what "get_command_descriptions" 
actually does:

ret, outbuf, outs = json_command(cluster, target,
 prefix='get_command_descriptions',
 timeout=30)

Is this a known issue?  If not, does anyone have any suggestions of how to 
further troubleshoot this further?  Thanks in advance.

Jon Heese
Systems Engineer
INetU Managed Hosting
P: 610.266.7441 x 261
F: 610.266.7434
www.inetu.net
** This message contains confidential information, which also may be 
privileged, and is intended only for the person(s) addressed above. Any 
unauthorized use, distribution, copying or disclosure of confidential and/or 
privileged information is strictly prohibited. If you have received this 
communication in error, please erase all copies of the message and its 
attachments and notify the sender immediately via reply e-mail. **
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network performance

2015-10-22 Thread Udo Lembke
Hi Jonas,
you can create an bond over multible NICs (depends on your switch which modes 
are possible) to use one IP addresses but
more than one NIC.

Udo

On 21.10.2015 10:23, Jonas Björklund wrote:
> Hello,
> 
> In the configuration I have read about "cluster network" and "cluster addr".
> Is it possible to make the OSDs to listens to multiple IP addresses?
> I want to use several network interfaces to increase performance.
> 
> I hav tried
> 
> [global]
> cluster network = 172.16.3.0/24,172.16.4.0/24
> 
> [osd.0]
> public addr = 0.0.0.0
> #public addr = 172.16.3.1
> #public addr = 172.16.4.1
> 
> But I cant get them to listen to both 172.16.3.1 and 172.16.4.1 at the same 
> time.
> 
> Any ideas?
> 
> /Jonas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse and its memory usage

2015-10-22 Thread Gregory Farnum
On Thu, Oct 22, 2015 at 1:59 AM, Yan, Zheng  wrote:
> direct IO only bypass kernel page cache. data still can be cached in
> ceph-fuse. If I'm correct, the test repeatedly writes data to 8M
> files. The cache make multiple write  assimilate into single OSD
> write

Ugh, of course. I don't see a tracker ticket for that, so I made one:
http://tracker.ceph.com/issues/13569
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to understand deep flatten implementation

2015-10-22 Thread Jason Dillaman
The flatten operation is implemented by writing zero bytes to each object 
within a clone image.  This causes librbd to copyup the backing object from the 
parent image to the clone image.  A copyup is just a guarded write that will 
not write to the clone if the object already exists (i.e. new data has already 
been written to clone).

Without the deep-flatten changes, when you attempt to flatten a clone with 
snapshots, the copyup "write" operation only affects the HEAD version of the 
clone image objects.  Therefore, post-flatten, if I attempt to read from a 
snapshot taken before the flatten copyups, it would appear as if the object 
doesn't exist (since it doesn't exist until the HEAD version).  That is why 
flatten would not previously detach from the parent image if you have snapshots 
-- since reading from your snapshot might result in a missing object that 
exists within the parent.

The change implemented by deep-flatten modifies the copyup logic to simulate 
that the write operation occurred before all snapshots were taken.  Therefore, 
the clone object will now be visible within all snapshots (not just the HEAD 
version).  This new copyup logic is actually always enabled in infernalis, 
regardless of whether or not you enable deep-flatten.  The only purpose for the 
"deep-flatten" feature bit is to prevent pre-infernalis clients from writing to 
the image with the old copyup logic.  When it's enabled, we know that it is 
safe to detach a clone from a parent image even if snapshots exist due to the 
changes to copyup.

-- 

Jason Dillaman 


- Original Message - 

> From: "Zhongyan Gu" 
> To: dilla...@redhat.com
> Sent: Thursday, October 22, 2015 5:11:56 AM
> Subject: how to understand deep flatten implementation

> Hi Jason,
> I am a ceph user. I see you are the main developer in rbd module and one of
> the features you worked on is deep flatten, so I turn to you for help.
> I tried to figure out how deep flatten works, but failed. Things begin like
> that:
> I used the firefly version to do the test:
> Image —make_snap->clone--> volume1 --make_snap--> flatten volume1.
> Then I found volume1’s snap still has parent which points image’s snap.
> Flatten didn’t decouple the snapshots.
> I noticed from ceph website that a feature named deep flatten will implement
> those things.
> I think the only missed step is deregister child clone from parent.
> However, when I reviewed the deep flatten implementation, the logic is far
> from what I thought.

> Could you please explain why a copyup need to send an empty snap context?
> I read the description of http://tracker.ceph.com/issues/10154 ,

> What does it mean “ the osd will treat it as an old write and will logically
> replace all versions of the object that did not exist ”

> Looking forward to your reply.

> Cheers
> Cory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and upgrading OS version

2015-10-22 Thread Luis Periquito
There are several routes you can follow for this work. The best one
will depend on cluster size, current data, pool definition (size),
performance expectations, etc.

They range from doing dist-upgrade a node at a time, to
remove-upgrade-then-add nodes to the cluster. But knowing that ceph is
"self-healing" if you aren't somewhat careful you can do an upgrade
online without much disruption if any (performance will always be
impacted).

On Thu, Oct 22, 2015 at 12:22 PM, Andrei Mikhailovsky  wrote:
>
> Any thoughts anyone?
>
> Is it safe to perform OS version upgrade on the osd and mon servers?
>
> Thanks
>
> Andrei
>
> 
> From: "Andrei Mikhailovsky" 
> To: ceph-us...@ceph.com
> Sent: Tuesday, 20 October, 2015 8:05:19 PM
> Subject: [ceph-users] ceph and upgrading OS version
>
> Hello everyone
>
> I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am
> wondering if you have a recommended process of upgrading the OS version
> without causing any issues to the ceph cluster?
>
> Many thanks
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and upgrading OS version

2015-10-22 Thread Andrei Mikhailovsky


Any thoughts anyone? 

Is it safe to perform OS version upgrade on the osd and mon servers? 

Thanks 

Andrei 

- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-us...@ceph.com 
Sent: Tuesday, 20 October, 2015 8:05:19 PM 
Subject: [ceph-users] ceph and upgrading OS version 


Hello everyone 

I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am 
wondering if you have a recommended process of upgrading the OS version without 
causing any issues to the ceph cluster? 

Many thanks 

Andrei 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-22 Thread Wido den Hollander
On 10/21/2015 11:25 AM, Jan Schermer wrote:
> 
>> On 21 Oct 2015, at 09:11, Wido den Hollander  wrote:
>>
>> On 10/20/2015 09:45 PM, Martin Millnert wrote:
>>> The thing that worries me with your next-gen design (actually your current 
>>> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 
>>> 12TB/day per 64TB total.  I guess use case dependant,  and perhaps 1:4 
>>> write read ratio is quite high in terms of writes as-is.
>>> You're also throughput-limiting yourself to the pci-e bw of the NVME device 
>>> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok 
>>> of course in relative terms. NVRAM vs SSD here is simply a choice between 
>>> wear (NVRAM as journal minimum), and cache hit probability (size).  
>>> Interesting thought experiment anyway for me, thanks for sharing Wido.
>>> /M
>>
>> We are looking at the PC 3600DC 1.2TB, according to the specs from
>> Intel: 10.95PBW
>>
>> Like I mentioned in my reply to Mark, we are still running on 1Gbit and
>> heading towards 10Gbit.
>>
>> Bandwidth isn't really a issue in our cluster. During peak moments we
>> average about 30k IOps through the cluster, but the TOTAL client I/O is
>> just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O.
>>
>> Bandwidth-wise there is no need for 10Gbit, but we are doing it for the
>> lower latency and thus more IOps.
>>
>> Currently our S3700 SSDs are peaking at 50% utilization according to iostat.
>>
>> After 2 years of operation the lowest Media_Wearout_Indicator we see is
>> 33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating
>> that the SSD is worn out.
>>
>> So in 24 months we worn through 67% of the SSD. A quick calculation
>> tells me we still have 12 months left on that SSD before it dies.
> 
> Could you maybe run isdct and compare what it says about expected lifetime? I 
> think isdct will report a much longer lifetime than you expect.
> 
> For comparison one of my drives (S3610, 1.2TB) - this drive has 3 DWPD rating 
> (~6.5PB written)
> 
> 241 Total_LBAs_Written  0x0032   100   100   000Old_age   Always  
>  -   1487714 <-- units of 32MB, that translates to ~47TB
> 233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always  
>  -   0 (maybe my smartdb needs updating, but this is what it says)
> 9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
> -   1008
> 
> If I extrapolate this blindly I would expect the SSD to reach it's TBW of 
> 6.5PB in about 15 years.
> 
> But isdct says:
> EnduranceAnalyzer: 46.02 Years
> 
> If I reverse it and calculate the endurance based on smart values, that would 
> give the expected lifetime of over 18PB (which is not impossible at all), but 
> isdct is a bit smarter and looks at what the current use pattern is. It's 
> clearly not only about discarding the initial bursts when the drive was 
> filled during backfilling because it's not that much, and all my S3610 drives 
> indicate a similiar endurance of 40 years (+-10).
> 
> I'd trust isdct over extrapolated SMART values - I think the SSD will 
> actually switch to a different calculation scheme when it reaches certain 
> lifepoint (when all reserve blocks are used, or when first cells start to 
> die...) which is why there's a discrepancy.
> 


$ ./isdct show -intelssd
$ ./isdct set -intelssd 1 enduranceanalyzer=reset
<< WAIT >60 MIN >>
$ ./isdct show -a -intelssd 1

Told me:

DevicePath: /dev/sg3
DeviceStatus: Healthy
EnduranceAnalyzer: 0.38 Years
ErrorString:

So this SSD has little over 6 months before it dies. However, the
cluster has been in a backfill state for over 24 hours now and I've
taken this sample while a backfill was happening, probably not the best
time.

IF the 0.38 years is true, then the SSD had a lifespawn of 2 years and 8
months as a Journaling / Bcache SSD.

Wido

> Jan
> 
> 
>>
>> But this is the lowest, other SSDs which were taken into production at
>> the same moment are ranging between 36 and 61.
>>
>> Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the
>> SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD
>> has some spare cells.
>>
>> Wido
>>
>>>
>>>  Original message 
>>> From: Wido den Hollander  
>>> Date: 20/10/2015  16:00  (GMT+01:00) 
>>> To: ceph-users  
>>> Subject: [ceph-users] Ceph OSDs with bcache experience 
>>>
>>> Hi,
>>>
>>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>>> bcache in production and Mark Nelson asked me to share some details.
>>>
>>> Bcache is running in two clusters now that I manage, but I'll keep this
>>> information to one of them (the one at PCextreme behind CloudStack).
>>>
>>> In this cluster has been running for over 2 years now:
>>>
>>> epoch 284353
>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>>> created 2013-09-23 11:06:11.819520
>>> modified 2015-10-20 15:27:48.734213
>>>
>>> The system consists out of 39 hosts:
>>>
>>> 2U SuperMi

Re: [ceph-users] ceph-fuse and its memory usage

2015-10-22 Thread Yan, Zheng
On Thu, Oct 22, 2015 at 4:47 AM, Gregory Farnum  wrote:
> On Tue, Oct 13, 2015 at 10:09 PM, Goncalo Borges
>  wrote:
>> Hi all...
>>
>> Thank you for the feedback, and I am sorry for my delay in replying.
>>
>> 1./ Just to recall the problem, I was testing cephfs using fio in two
>> ceph-fuse clients:
>>
>> - Client A is in the same data center as all OSDs connected at 1 GbE
>> - Client B is in a different data center (in another city) also connected at
>> 1 GbE. However, I've seen that the connection is problematic, and sometimes,
>> the network performance is well bellow the theoretical 1 Gbps limit.
>> - Client A has 24 GB RAM + 98 GB of SWAP and client B has 48 GB of RAM + 98
>> GB of SWAP
>>
>> and I was seeing that Client B was giving much better fio throughput
>> because it was hitting the cache much more than Client A.
>>
>> --- * ---
>>
>> 2./ I was suspecting that Client B was hitting the cache because it had bad
>> connectivity to the Ceph Cluster. I actually tried to sort that out and I
>> was able to nail down a problem in a bad switch. However, after that, I
>> still see the same behaviour which I can reproduce in a systematic way.
>>
>> --- * ---
>>
>> 3./ In a new round of tests in Client B, I've applied the following
>> procedure:
>>
>> 3.1/ This is the network statistics right before starting my fio test:
>>
>> * Printing network statistics:
>> * /sys/class/net/eth0/statistics/collisions: 0
>> * /sys/class/net/eth0/statistics/multicast: 453650
>> * /sys/class/net/eth0/statistics/rx_bytes: 437704562785
>> * /sys/class/net/eth0/statistics/rx_compressed: 0
>> * /sys/class/net/eth0/statistics/rx_crc_errors: 0
>> * /sys/class/net/eth0/statistics/rx_dropped: 0
>> * /sys/class/net/eth0/statistics/rx_errors: 0
>> * /sys/class/net/eth0/statistics/rx_fifo_errors: 0
>> * /sys/class/net/eth0/statistics/rx_frame_errors: 0
>> * /sys/class/net/eth0/statistics/rx_length_errors: 0
>> * /sys/class/net/eth0/statistics/rx_missed_errors: 0
>> * /sys/class/net/eth0/statistics/rx_over_errors: 0
>> * /sys/class/net/eth0/statistics/rx_packets: 387690140
>> * /sys/class/net/eth0/statistics/tx_aborted_errors: 0
>> * /sys/class/net/eth0/statistics/tx_bytes: 149206610455
>> * /sys/class/net/eth0/statistics/tx_carrier_errors: 0
>> * /sys/class/net/eth0/statistics/tx_compressed: 0
>> * /sys/class/net/eth0/statistics/tx_dropped: 0
>> * /sys/class/net/eth0/statistics/tx_errors: 0
>> * /sys/class/net/eth0/statistics/tx_fifo_errors: 0
>> * /sys/class/net/eth0/statistics/tx_heartbeat_errors: 0
>> * /sys/class/net/eth0/statistics/tx_packets: 241698327
>> * /sys/class/net/eth0/statistics/tx_window_errors: 0
>>
>> 3.2/ I've then launch my fio test. Please note that I am dropping caches
>> before starting the test (sync; echo 3 > /proc/sys/vm/drop_caches). My
>> current fio test has nothing fancy. Here are the options:
>>
>> # cat fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.in
>> [fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036]
>> ioengine=libaio
>> iodepth=64
>> rw=write
>> bs=512K
>> direct=1
>> size=8192m
>> numjobs=128
>> filename=fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.data
>
> Oh right, so you're only using 8GB of data to write over (and you're
> hitting it a bunch of times). So if not for the direct IO flag this
> would sort of make sense.
>
> But with that, I'm very confused. There can be some annoying little
> pieces of making direct IO get passed correctly through all the FUSE
> interfaces, but I *thought* we were going through the hoops and making
> things work. Perhaps I am incorrect. Zheng, do you know anything about
> this?
>

direct IO only bypass kernel page cache. data still can be cached in
ceph-fuse. If I'm correct, the test repeatedly writes data to 8M
files. The cache make multiple write  assimilate into single OSD
write.

Regards
Yan, Zheng


>>
>> I am no sure if it matters, but the layout of my dir is the following:
>>
>> # getfattr -n ceph.dir.layout /cephfs/sydney
>> getfattr: Removing leading '/' from absolute path names
>> # file: cephfs/sydney
>> ceph.dir.layout="stripe_unit=524288 stripe_count=8 object_size=4194304
>> pool=cephfs_dt"
>>
>> 3.3/ fio produced the following result for the aggregated bandwidth. If
>> I translate that number to Gbps, I get almost 3 Gbps which is impossible.
>>
>> # grep aggrb
>> fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.out
>>   WRITE: io=1024.0GB, aggrb=403101KB/s, minb=3149KB/s, maxb=3154KB/s,
>> mint=2659304msec, maxt=2663699msec
>>
>> 3.4 This is the network statistics immediately after the test
>>
>> * Printing network statistics:
>> * /sys/class/net/eth0/statistics/collisions: 0
>> * /sys/class/net/eth0/statistics/multicast: 454539
>> * /sys/class/net/eth0/statistics/rx_bytes: 440300506875
>> * /sys/class/net/eth0/statistics/rx_compressed: 0
>> * /sys/class/net/eth0/statistics/rx_crc_errors: 0
>> * /sys/class/net/eth0/statistics/rx_dropped: 0
>> *

Re: [ceph-users] ceph-hammer and debian jessie - missing files on repository

2015-10-22 Thread Björn Lässig

On 10/21/2015 08:50 PM, Alfredo Deza wrote:

This shouldn't be a problem, would you mind trying again? I just
managed to install on Debian Jessie without problems


after syncing our mirror via IPv4 again (ipv6 is still broken), we have 
the missing bpo80 packages for debian-jessie!


Thanks for your help

Björn Lässig

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-22 Thread Wido den Hollander
On 10/21/2015 03:30 PM, Mark Nelson wrote:
> 
> 
> On 10/21/2015 01:59 AM, Wido den Hollander wrote:
>> On 10/20/2015 07:44 PM, Mark Nelson wrote:
>>> On 10/20/2015 09:00 AM, Wido den Hollander wrote:
 Hi,

 In the "newstore direction" thread on ceph-devel I wrote that I'm using
 bcache in production and Mark Nelson asked me to share some details.

 Bcache is running in two clusters now that I manage, but I'll keep this
 information to one of them (the one at PCextreme behind CloudStack).

 In this cluster has been running for over 2 years now:

 epoch 284353
 fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
 created 2013-09-23 11:06:11.819520
 modified 2015-10-20 15:27:48.734213

 The system consists out of 39 hosts:

 2U SuperMicro chassis:
 * 80GB Intel SSD for OS
 * 240GB Intel S3700 SSD for Journaling + Bcache
 * 6x 3TB disk

 This isn't the newest hardware. The next batch of hardware will be more
 disks per chassis, but this is it for now.

 All systems were installed with Ubuntu 12.04, but they are all running
 14.04 now with bcache.

 The Intel S3700 SSD is partitioned with a GPT label:
 - 5GB Journal for each OSD
 - 200GB Partition for bcache

 root@ceph11:~# df -h|grep osd
 /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
 /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
 /dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
 /dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
 /dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
 /dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
 root@ceph11:~#

 root@ceph11:~# lsb_release -a
 No LSB modules are available.
 Distributor ID:Ubuntu
 Description:Ubuntu 14.04.3 LTS
 Release:14.04
 Codename:trusty
 root@ceph11:~# uname -r
 3.19.0-30-generic
 root@ceph11:~#

 "apply_latency": {
   "avgcount": 2985023,
   "sum": 226219.891559000
 }

 What did we notice?
 - Less spikes on the disk
 - Lower commit latencies on the OSDs
 - Almost no 'slow requests' during backfills
 - Cache-hit ratio of about 60%

 Max backfills and recovery active are both set to 1 on all OSDs.

 For the next generation hardware we are looking into using 3U chassis
 with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we
 haven't
 tested those yet, so nothing to say about it.

 The current setup is 200GB of cache for 18TB of disks. The new setup
 will be 1200GB for 64TB, curious to see what that does.

 Our main conclusion however is that it does smoothen the I/O-pattern
 towards the disks and that gives a overall better response of the
 disks.
>>>
>>> Hi Wido, thanks for the big writeup!  Did you guys happen to do any
>>> benchmarking?  I think Xiaoxi looked at flashcache a while back but had
>>> mixed results if I remember right.  It would be interesting to know how
>>> bcache is affecting performance in different scenarios.
>>>
>>
>> No, we didn't do any benchmarking. Initially this cluster was build for
>> just the RADOS Gateway, so we went for 2Gbit (2x 1Gbit) per machine. 90%
>> is still Gbit networking and we are in the process of upgrading it all
>> to 10Gbit.
>>
>> Since the 1Gbit network latency is about 4 times higher then 10Gbit we
>> aren't really benchmarking the cluster.
>>
>> What counts for us most is that we can do recovery operations without
>> any slow requests.
>>
>> Before bcache we saw disks spike to 100% busy while a backfill was busy.
>> Now bcache smoothens this and we see peaks of maybe 70%, but that's it.
> 
> In the testing I was doing to figure out our new lab hardware, I was
> seeing SSDs handle recovery dramatically better than spinning disks as
> well during cephtestrados runs.  It might be worth digging in to see
> what the IO patterns look like.  In the mean time though, it's very
> interesting that bcache helps in this case so much.  Good to know!
> 

Keep in mind that CentOS 7.1 doesn't support bcache natively in the kernel.

Would be nice if RHEL/CentOS would also support bcache.

Wido

>>
>>> Thanks,
>>> Mark
>>>

 Wido

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___