Re: [ceph-users] Giant or Firefly for production
On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 I tried with different ceph osd tunables but nothing seems to fix the issue However, this cluster will be mainly used for OpenStack, and qemu is able to access the rbd volume, so this might not be a big problem for me. .a. -- antonio.s.mess...@gmail.com antonio.mess...@uzh.ch +41 (0)44 635 42 22 S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] weird 'ceph-deploy disk list nodename' command output, Invalid partition data
hi, all When I run command 'ceph-deploy disk list nodename', there are some warning messages indicate partition table error, but the ceph cluster is working normally. What is the problem, should I run sgdisk command to repair the partition table? Below is the warning messages: ceph@controller-11:~$ ceph-deploy disk list c13 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.5): /usr/bin/ceph-deploy disk list c13 [c13][DEBUG ] connected to host: c13 [c13][DEBUG ] detect platform information from remote host [c13][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: Ubuntu 13.04 raring [ceph_deploy.osd][DEBUG ] Listing disks on c13... [c13][INFO ] Running command: sudo ceph-disk list [c13][DEBUG ] /dev/sda : [c13][DEBUG ] /dev/sda1 other, ext4, mounted on / [c13][DEBUG ] /dev/sda2 other, ext3 [c13][DEBUG ] /dev/sdb : [c13][DEBUG ] /dev/sdb1 ceph journal, for /dev/sdc1 [c13][DEBUG ] /dev/sdb2 ceph journal, for /dev/sdd1 [c13][DEBUG ] /dev/sdb3 ceph journal, for /dev/sde1 [c13][DEBUG ] /dev/sdb4 ceph journal, for /dev/sdf1 [c13][DEBUG ] /dev/sdc : [c13][DEBUG ] /dev/sdc1 ceph data, active, cluster ceph, osd.1, journal /dev/sdb1 [c13][DEBUG ] /dev/sdd : [c13][DEBUG ] /dev/sdd1 ceph data, active, cluster ceph, osd.6, journal /dev/sdb2 [c13][DEBUG ] /dev/sde : [c13][DEBUG ] /dev/sde1 ceph data, active, cluster ceph, osd.7, journal /dev/sdb3 [c13][DEBUG ] /dev/sdf : [c13][DEBUG ] /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal /dev/sdb4 [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sda [c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating main header [c13][WARNIN] from backup! [c13][WARNIN] [c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 'e' options [c13][WARNIN] on the recovery transformation menu to examine the two tables. [c13][WARNIN] [c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk! [c13][WARNIN] [c13][WARNIN] Invalid partition data! [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sda [c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating main header [c13][WARNIN] from backup! [c13][WARNIN] [c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 'e' options [c13][WARNIN] on the recovery transformation menu to examine the two tables. [c13][WARNIN] [c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk! [c13][WARNIN] [c13][WARNIN] Invalid partition data! [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 2 /dev/sda [c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating main header [c13][WARNIN] from backup! [c13][WARNIN] [c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 'e' options [c13][WARNIN] on the recovery transformation menu to examine the two tables. [c13][WARNIN] [c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk! [c13][WARNIN] [c13][WARNIN] Invalid partition data! [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sda [c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating main header [c13][WARNIN] from backup! [c13][WARNIN] [c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 'e' options [c13][WARNIN] on the recovery transformation menu to examine the two tables. [c13][WARNIN] [c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk! [c13][WARNIN] [c13][WARNIN] Invalid partition data! [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 2 /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 3 /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 4 /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdc [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdc [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/blkid -s TYPE /dev/sdc1 [c13][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o -- /dev/sdc1 /var/lib/ceph/tmp/mnt.bNqfD1 [c13][WARNIN] INFO:ceph-disk:Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.bNqfD1 [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdd [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdd [c13][WARNIN] INFO:ceph-disk:Running command: /sbin/blkid -s TYPE /dev/sdd1 [c13][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o -- /dev/sdd1 /var/lib/ceph/tmp/mnt.k913Cm
Re: [ceph-users] Giant osd problems - loss of IO
Jake, very usefull indeed. It looks like I had a similar problem regarding the heartbeat and as you' have mentioned, I've not seen such issues on Firefly. However, i've not seen any osd crashes. Could you please let me know where you got the sysctrl.conf tunings from? Was it recommended by the network vendor? Also, did you make similar sysctrl.conf changes to your host servers? A while ago i've read the tunning guide for IP over Infiniband and the Mellanox recommends setting something like this: net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 1 net.core.netdev_max_backlog = 25 net.core.rmem_max = 4194304 net.core.wmem_max = 4194304 net.core.rmem_default = 4194304 net.core.wmem_default = 4194304 net.core.optmem_max = 4194304 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 65536 4194304 net.ipv4.tcp_mem =4194304 4194304 4194304 net.ipv4.tcp_low_latency=1 which is what I have. Not sure if these are optimal. I can see that the values are pretty conservative compare to yours. I guess my values should be different as I am running a 40gbit/s network with ipoib. The actual throughput on ipoib is about 20gbit/s according iperf and alike. Andrei - Original Message - From: Jake Young jak3...@gmail.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Thursday, 4 December, 2014 4:57:47 PM Subject: Re: [ceph-users] Giant osd problems - loss of IO On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky and...@arhont.com wrote: Any other suggestions why several osds are going down on Giant and causing IO to stall? This was not happening on Firefly. Thanks I had a very similar probem to yours which started after upgrading from Firefly to Giant and then later I added two new osd nodes, with 7 osds on each. My cluster originally had 4 nodes, with 7 osds on each node, 28 osds total, running Gian. I did not have any problems at this time. My problems started after adding two new nodes, so I had 6 nodes and 42 total osds. It would run fine on low load, but when the request load increased, osds started to fall over. I was able to set the debug_ms to 10 and capture the logs from a failed OSD. There were a few different reasons the osds were going down. This example shows it terminating normally for an unspecified reason a minute after it notices it is marked down in the map. Osd 25 actually marks this osd (osd 35) down. For some reason many osds cannot communicate with each other. There are other examples where I see the heartbeat_check: no reply from osd.blah message for long periods of time (hours) and neither osd crashes or terminates. 2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01 16:26:46.772608) 2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01 16:26:47.772759) 2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01 16:26:48.772982) 2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01 16:26:49.559891) 2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01 16:26:49.773173) 2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01 16:26:50.773299) 2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01 16:26:51.261554) 2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01 16:26:51.773504) 2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01 16:26:52.773733) 2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01 16:26:53.773876) 2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check: no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01 16:26:54.163366) 2014-12-01 16:27:14.507632 7f8b4fb7f700 0
[ceph-users] AWS SDK and MultiPart Problem
Hi all! I am using AWS SDK JS v.2.0.29 to perform a multipart upload into Radosgw with ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403 error. I believe that the id which is send to all requests and has been urlencoded by the aws-sdk-js doesn't match with the one in rados because it's not urlencoded. Is that the case? Can you confirm it? Is there something I can do? Regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] AWS SDK and MultiPart Problem
For example if I try to perform the same multipart upload at an older version ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) I can see the upload ID in the apache log as: PUT /test/.dat?partNumber=25uploadId=I3yihBFZmHx9CCqtcDjr8d-RhgfX8NW HTTP/1.1 200 - - aws-sdk-nodejs/2.0.29 linux/v0.10.33 but when I try the same at ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) I get the following: PUT /test/.dat?partNumber=12uploadId=2%2Ff9UgnHhdK0VCnMlpT-XA8ttia1HjK36 HTTP/1.1 403 78 - aws-sdk-nodejs/2.0.29 linux/v0.10.33 and my guess is that the %2F at the latter is the one that is causing the problem and hence the 403 error. What do you think??? Best, George Hi all! I am using AWS SDK JS v.2.0.29 to perform a multipart upload into Radosgw with ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403 error. I believe that the id which is send to all requests and has been urlencoded by the aws-sdk-js doesn't match with the one in rados because it's not urlencoded. Is that the case? Can you confirm it? Is there something I can do? Regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Chinese translation of Ceph Documentation
Hi, I have migrated my Chinese translation from PDF to Ceph official doc build system. Just replace doc/ in ceph repository with this repo, the building should be work. The official doc build guide should be work for this too. Old PDF: https://github.com/drunkard/docs_zh New: https://github.com/drunkard/ceph-doc-zh_CN The html output: https://github.com/drunkard/docs_zh/tree/master/output/html There's a lot changes to sync with mainline since my last updates, will do it in spare time :) There's some issues to resolve: 1. The build system doesn't support python3 yet, so if you are using python3 as default interpreter, you shoule switch to python2 temporarily. 2. Building of man pages will fail, changes needed to support non-ascii encoding. HTML version is fine. Any way, I'm improving it ;) Any help is welcome! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] AWS SDK and MultiPart Problem
It would be nice to see where and how uploadId is being calculated... Thanks, George For example if I try to perform the same multipart upload at an older version ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) I can see the upload ID in the apache log as: PUT /test/.dat?partNumber=25uploadId=I3yihBFZmHx9CCqtcDjr8d-RhgfX8NW HTTP/1.1 200 - - aws-sdk-nodejs/2.0.29 linux/v0.10.33 but when I try the same at ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) I get the following: PUT /test/.dat?partNumber=12uploadId=2%2Ff9UgnHhdK0VCnMlpT-XA8ttia1HjK36 HTTP/1.1 403 78 - aws-sdk-nodejs/2.0.29 linux/v0.10.33 and my guess is that the %2F at the latter is the one that is causing the problem and hence the 403 error. What do you think??? Best, George Hi all! I am using AWS SDK JS v.2.0.29 to perform a multipart upload into Radosgw with ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403 error. I believe that the id which is send to all requests and has been urlencoded by the aws-sdk-js doesn't match with the one in rados because it's not urlencoded. Is that the case? Can you confirm it? Is there something I can do? Regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 I tried with different ceph osd tunables but nothing seems to fix the issue However, this cluster will be mainly used for OpenStack, and qemu is able to access the rbd volume, so this might not be a big problem for me. .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure Encoding Chunks
Hi All, Does anybody have any input on what the best ratio + total numbers of Data + Coding chunks you would choose? For example I could create a pool with 7 data chunks and 3 coding chunks and get an efficiency of 70%, or I could create a pool with 17 data chunks and 3 coding chunks and get an efficiency of 85% with a similar probability of protecting against OSD failure. What's the reason I would choose 10 total chunks over 20 chunks, is it purely down to the overhead of having potentially double the number of chunks per object? Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
This is probably due to the Kernel RBD client not being recent enough. Have you tried upgrading your kernel to a newer version? 3.16 should contain all the relevant features required by Giant. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Antonio Messina Sent: 05 December 2014 09:37 To: Anthony Alba Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Giant or Firefly for production On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 I tried with different ceph osd tunables but nothing seems to fix the issue However, this cluster will be mainly used for OpenStack, and qemu is able to access the rbd volume, so this might not be a big problem for me. .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
What are the kernel versions involved ? We have Ubuntu precise clients talking to a Ubuntu trusty cluster without issues - with tunables optimal. 0.88 (Giant) and 0.89 has been working well for us as far the client and Openstack are concerned. This link provides some insight as to the possible problems: http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client Things to look for: - Kernel versions - Cache tiering - Tunables - hashpspool -- David Moreau Simard On Dec 5, 2014, at 4:36 AM, Antonio Messina antonio.mess...@s3it.uzh.ch wrote: On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 I tried with different ceph osd tunables but nothing seems to fix the issue However, this cluster will be mainly used for OpenStack, and qemu is able to access the rbd volume, so this might not be a big problem for me. .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Virtual machines using RBD remount read-only on OSD slow requests
Hi, I recently e-mailed ceph-users about a problem with virtual machine RBD disks remounting read-only because of OSD slow requests[1]. I just wanted to report that although I'm still seeing OSDs from one particular machine going down sometimes (probably some hardware problem on that node), the virtual machines haven't been remounting their disks read-only. I can't be sure of the cause, because I didn't do any controlled tests (or any tests at all), but one thing I changed was the osd_recovery_op_priority, from the default 10 to 5. I had seen some suggestions on this list regarding that parameter and they may well have been useful in my case. Cheers, Paulo [1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044887.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
On Fri, Dec 5, 2014 at 4:25 PM, David Moreau Simard dmsim...@iweb.com wrote: What are the kernel versions involved ? We have Ubuntu precise clients talking to a Ubuntu trusty cluster without issues - with tunables optimal. 0.88 (Giant) and 0.89 has been working well for us as far the client and Openstack are concerned. This link provides some insight as to the possible problems: Both servers and clients are Ubuntu Trusty. Kernel versions are a bit different: client: 3.13.0-39-generic #66 server: 3.13.0-32-generic #57 ceph version on both: 0.87 http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client Things to look for: - Kernel versions - Cache tiering - Tunables - hashpspool I have already read the blogpost, but I don't have much experience with tunables. From what I understood I am missing: * CEPH_FEATURE_CRUSH_TUNABLES3 * CEPH_FEATURE_CRUSH_V2 but I don't know how to disable them, and I can't see them set in the crushmap I get from ceph osd getcrushmap .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote: This is probably due to the Kernel RBD client not being recent enough. Have you tried upgrading your kernel to a newer version? 3.16 should contain all the relevant features required by Giant. I would rather tune the tunables, as upgrading the kernel would require a reboot of the client. Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need to recompile... .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
http://kernel.ubuntu.com/~kernel-ppa/mainline/ I'm running 3.17 on my trusty clients without issue On Fri, Dec 5, 2014 at 9:37 AM, Antonio Messina antonio.mess...@s3it.uzh.ch wrote: On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote: This is probably due to the Kernel RBD client not being recent enough. Have you tried upgrading your kernel to a newer version? 3.16 should contain all the relevant features required by Giant. I would rather tune the tunables, as upgrading the kernel would require a reboot of the client. Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need to recompile... .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
Ok sorry, I thought you had a need for some of the features in Giant, using tunables is probably easier in that case. However if you do want to upgrade there are debs available:- http://kernel.ubuntu.com/~kernel-ppa/mainline/ and I believe 3.16 should be available in the 14.04.2 release, which should be released early next year. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Antonio Messina Sent: 05 December 2014 15:38 To: Nick Fisk Cc: ceph-users@lists.ceph.com; Antonio Messina Subject: Re: [ceph-users] Giant or Firefly for production On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote: This is probably due to the Kernel RBD client not being recent enough. Have you tried upgrading your kernel to a newer version? 3.16 should contain all the relevant features required by Giant. I would rather tune the tunables, as upgrading the kernel would require a reboot of the client. Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need to recompile... .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
Thank you James and Nick, On Fri, Dec 5, 2014 at 4:46 PM, Nick Fisk n...@fisk.me.uk wrote: Ok sorry, I thought you had a need for some of the features in Giant, using tunables is probably easier in that case. I'm not sure :) I never played with the tunables before (still running a testbed only) I will test it again with 14.04.2 and default kernel beginning of next year, I prefer to use the official kernel for the production cluster, but since it's going to be deployed Q1-Q2 next year I should be safe. .a. However if you do want to upgrade there are debs available:- http://kernel.ubuntu.com/~kernel-ppa/mainline/ and I believe 3.16 should be available in the 14.04.2 release, which should be released early next year. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Antonio Messina Sent: 05 December 2014 15:38 To: Nick Fisk Cc: ceph-users@lists.ceph.com; Antonio Messina Subject: Re: [ceph-users] Giant or Firefly for production On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote: This is probably due to the Kernel RBD client not being recent enough. Have you tried upgrading your kernel to a newer version? 3.16 should contain all the relevant features required by Giant. I would rather tune the tunables, as upgrading the kernel would require a reboot of the client. Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need to recompile... .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
On Fri, 5 Dec 2014, Antonio Messina wrote: On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 Can you attach the output of ceph osd crush show-tunables -f json-pretty ceph osd crush dump -f json-pretty Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor RBD performance as LIO iSCSI target
I've flushed everything - data, pools, configs and reconfigured the whole thing. I was particularly careful with cache tiering configurations (almost leaving defaults when possible) and it's not locking anymore. It looks like the cache tiering configuration I had was causing the problem ? I can't put my finger on exactly what/why and I don't have the luxury of time to do this lengthy testing again. Here's what I dumped as far as config goes before wiping: # for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; do ceph osd pool get volumes $var; done size: 5 min_size: 2 pg_num: 7200 pgp_num: 7200 crush_ruleset: 1 erasure_code_profile: ecvolumes # for var in size min_size pg_num pgp_num crush_ruleset hit_set_type hit_set_period hit_set_count target_max_objects target_max_bytes cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age; do ceph osd pool get volumecache $var; done size: 2 min_size: 1 pg_num: 7200 pgp_num: 7200 crush_ruleset: 4 hit_set_type: bloom hit_set_period: 3600 hit_set_count: 1 target_max_objects: 0 target_max_bytes: 1000 cache_target_dirty_ratio: 0.5 cache_target_full_ratio: 0.8 cache_min_flush_age: 600 cache_min_evict_age: 1800 # ceph osd erasure-code-profile get ecvolumes directory=/usr/lib/ceph/erasure-code k=3 m=2 plugin=jerasure ruleset-failure-domain=osd technique=reed_sol_van And now: # for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; do ceph osd pool get volumes $var; done size: 5 min_size: 3 pg_num: 2048 pgp_num: 2048 crush_ruleset: 1 erasure_code_profile: ecvolumes # for var in size min_size pg_num pgp_num crush_ruleset hit_set_type hit_set_period hit_set_count target_max_objects target_max_bytes cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age; do ceph osd pool get volumecache $var; done size: 2 min_size: 1 pg_num: 2048 pgp_num: 2048 crush_ruleset: 4 hit_set_type: bloom hit_set_period: 3600 hit_set_count: 1 target_max_objects: 0 target_max_bytes: 1500 cache_target_dirty_ratio: 0.5 cache_target_full_ratio: 0.8 cache_min_flush_age: 0 cache_min_evict_age: 1800 # ceph osd erasure-code-profile get ecvolumes directory=/usr/lib/ceph/erasure-code k=3 m=2 plugin=jerasure ruleset-failure-domain=osd technique=reed_sol_van Crush map hasn't really changed before and after. FWIW, the benchmarks I pulled out of the setup: https://gist.github.com/dmsimard/2737832d077cfc5eff34 Definite overhead going from krbd to krbd + LIO... -- David Moreau Simard On Nov 20, 2014, at 4:14 PM, Nick Fisk n...@fisk.me.uk wrote: Here you go:- Erasure Profile k=2 m=1 plugin=jerasure ruleset-failure-domain=osd ruleset-root=hdd technique=reed_sol_van Cache Settings hit_set_type: bloom hit_set_period: 3600 hit_set_count: 1 target_max_objects target_max_objects: 0 target_max_bytes: 10 cache_target_dirty_ratio: 0.4 cache_target_full_ratio: 0.8 cache_min_flush_age: 0 cache_min_evict_age: 0 Crush Dump # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host ceph-test-hdd { id -5 # do not change unnecessarily # weight 2.730 alg straw hash 0 # rjenkins1 item osd.1 weight 0.910 item osd.2 weight 0.910 item osd.0 weight 0.910 } root hdd { id -3 # do not change unnecessarily # weight 2.730 alg straw hash 0 # rjenkins1 item ceph-test-hdd weight 2.730 } host ceph-test-ssd { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 } root ssd { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item ceph-test-ssd weight 1.000 } # rules rule hdd { ruleset 0 type replicated min_size 0 max_size 10 step take hdd step chooseleaf firstn 0 type osd step emit } rule ssd { ruleset 1 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type osd step emit } rule ecpool { ruleset 2 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step take hdd step chooseleaf indep 0 type osd step emit } -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David Moreau Simard Sent: 20 November 2014 20:03
Re: [ceph-users] Giant or Firefly for production
On Fri, Dec 5, 2014 at 4:59 PM, Sage Weil s...@newdream.net wrote: On Fri, 5 Dec 2014, Antonio Messina wrote: On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 Can you attach the output of I modified the crushmap and set: tunable chooseleaf_vary_r 0 (it was 1 before) Now the cluster is rebalancing, and since it's on crappy hardware is taking some time. I'm pasting the output of the two commands, but please keep in mind that this is the output *after* I've updated the chooseleaf_vary_r tunable. ceph osd crush show-tunables -f json-pretty { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 50, chooseleaf_descend_once: 1, profile: bobtail, optimal_tunables: 0, legacy_tunables: 0, require_feature_tunables: 1, require_feature_tunables2: 1, require_feature_tunables3: 0, has_v2_rules: 1, has_v3_rules: 0} ceph osd crush dump -f json-pretty I'm attaching it as a text file, as it is quite big and unreadable. However, from the output I see the following tunables: tunables: { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 50, chooseleaf_descend_once: 1, profile: bobtail, optimal_tunables: 0, legacy_tunables: 0, require_feature_tunables: 1, require_feature_tunables2: 1, require_feature_tunables3: 0, has_v2_rules: 1, has_v3_rules: 0}} .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland crushmap.json Description: application/json ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
Hi all, just an update After setting chooseleaf_vary_r to 0 _and_ removing an pool with erasure coding, I was able to run rbd map. Thank you all for the help .a. On Fri, Dec 5, 2014 at 5:07 PM, Antonio Messina antonio.mess...@s3it.uzh.ch wrote: On Fri, Dec 5, 2014 at 4:59 PM, Sage Weil s...@newdream.net wrote: On Fri, 5 Dec 2014, Antonio Messina wrote: On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 Can you attach the output of I modified the crushmap and set: tunable chooseleaf_vary_r 0 (it was 1 before) Now the cluster is rebalancing, and since it's on crappy hardware is taking some time. I'm pasting the output of the two commands, but please keep in mind that this is the output *after* I've updated the chooseleaf_vary_r tunable. ceph osd crush show-tunables -f json-pretty { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 50, chooseleaf_descend_once: 1, profile: bobtail, optimal_tunables: 0, legacy_tunables: 0, require_feature_tunables: 1, require_feature_tunables2: 1, require_feature_tunables3: 0, has_v2_rules: 1, has_v3_rules: 0} ceph osd crush dump -f json-pretty I'm attaching it as a text file, as it is quite big and unreadable. However, from the output I see the following tunables: tunables: { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 50, chooseleaf_descend_once: 1, profile: bobtail, optimal_tunables: 0, legacy_tunables: 0, require_feature_tunables: 1, require_feature_tunables2: 1, require_feature_tunables3: 0, has_v2_rules: 1, has_v3_rules: 0}} .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Encoding Chunks
On 05/12/2014 16:21, Nick Fisk wrote: Hi All, Does anybody have any input on what the best ratio + total numbers of Data + Coding chunks you would choose? For example I could create a pool with 7 data chunks and 3 coding chunks and get an efficiency of 70%, or I could create a pool with 17 data chunks and 3 coding chunks and get an efficiency of 85% with a similar probability of protecting against OSD failure. What’s the reason I would choose 10 total chunks over 20 chunks, is it purely down to the overhead of having potentially double the number of chunks per object? Hi Nick, Assuming you have a large number of OSD (a thousand or more) with cold data, 20 is probably better. When you try to read the data it involves 20 OSDs instead of 10 but you probably don't care if reads are rare. Disclaimer : I'm a developer not an architect ;-) It would help to know the target use case, the size of the data set and the expected read/write rate. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
On Fri, 5 Dec 2014, Antonio Messina wrote: On Fri, Dec 5, 2014 at 4:59 PM, Sage Weil s...@newdream.net wrote: On Fri, 5 Dec 2014, Antonio Messina wrote: On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote: Hi Cephers, Have anyone of you decided to put Giant into production instead of Firefly? This is very interesting to me too: we are going to deploy a large ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd module in Ubuntu Trusty doesn't seem compatible with giant: feature set mismatch, my 4a042a42 server's 2104a042a42, missing 210 Can you attach the output of I modified the crushmap and set: tunable chooseleaf_vary_r 0 (it was 1 before) Now the cluster is rebalancing, and since it's on crappy hardware is taking some time. I'm pasting the output of the two commands, but please keep in mind that this is the output *after* I've updated the chooseleaf_vary_r tunable. ceph osd crush show-tunables -f json-pretty { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 50, chooseleaf_descend_once: 1, profile: bobtail, optimal_tunables: 0, legacy_tunables: 0, require_feature_tunables: 1, require_feature_tunables2: 1, require_feature_tunables3: 0, has_v2_rules: 1, has_v3_rules: 0} The v2 rule means you have a crush rule for erasure coding. Do you have an EC pool in your cluster? The tunables3 feature bit is set because you set the vary_r parameter. If you want older kernels to talk to the cluster, you need to avoid the new tunables and features! sage ceph osd crush dump -f json-pretty I'm attaching it as a text file, as it is quite big and unreadable. However, from the output I see the following tunables: tunables: { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 50, chooseleaf_descend_once: 1, profile: bobtail, optimal_tunables: 0, legacy_tunables: 0, require_feature_tunables: 1, require_feature_tunables2: 1, require_feature_tunables3: 0, has_v2_rules: 1, has_v3_rules: 0}} .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant or Firefly for production
On Fri, Dec 5, 2014 at 5:24 PM, Sage Weil s...@newdream.net wrote: The v2 rule means you have a crush rule for erasure coding. Do you have an EC pool in your cluster? Yes indeed. I didn't know EC pool was incompatible with the current kernel, I only tested it with rados bench and VMs, I guess. The tunables3 feature bit is set because you set the vary_r parameter. This I don't really know where it comes from. I think at a certain point I ran ceph osd crush tunables optimal, and it probably added vary_r, but then I run ceph osd crush tunables firefly and it didn't remove it... is it normal? If you want older kernels to talk to the cluster, you need to avoid the new tunables and features! Well, as I said, I'm not a ceph expert, I didn't even know I enabled features the kernel of the distribution did not support. I guess the problem is that I am using packages from the ceph.com repo, while the kernel comes from ubuntu. However, it's at least curious that when I was running firefly from ubuntu repositories I could create an EC pool, but the kernel was not compatible with EC2 pools... .a. -- antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22 antonio.s.mess...@gmail.com S3IT: Service and Support for Science IT http://www.s3it.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Virtual machines using RBD remount read-only on OSD slow requests
I hope you can provide with more runtime infos like logs On Fri, Dec 5, 2014 at 11:32 PM, Paulo Almeida palme...@igc.gulbenkian.pt wrote: Hi, I recently e-mailed ceph-users about a problem with virtual machine RBD disks remounting read-only because of OSD slow requests[1]. I just wanted to report that although I'm still seeing OSDs from one particular machine going down sometimes (probably some hardware problem on that node), the virtual machines haven't been remounting their disks read-only. I can't be sure of the cause, because I didn't do any controlled tests (or any tests at all), but one thing I changed was the osd_recovery_op_priority, from the default 10 to 5. I had seen some suggestions on this list regarding that parameter and they may well have been useful in my case. Cheers, Paulo [1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044887.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Encoding Chunks
On 05/12/2014 17:41, Nick Fisk wrote: Hi Loic, Thanks for your response. The idea for this cluster will be for our VM Replica storage in our secondary site. Initially we are planning to have a 40 disk EC pool sitting behind a cache pool of around 1TB post replica size. This storage will be presented as RBD's and then exported as a HA iSCSI target to ESX hosts. The VM's will be replicated from our primary site via a software product called Veeam. I'm hoping that the 1TB cache layer should be big enough to hold most of the hot data meaning that the EC pool shouldn't see a large amount of IO, just the trickle of the cache layer flushing back to disk. We can switch back to a 3 way replica pool if the EC pool doesn't work out for us, but we are interested in testing out the EC technology. I hope that provides an insight to what I am trying to achieve. When the erasure coded object has to be promoted back to the replicated pool, you want that to happen as fast as possible. The read will return when all 6 OSDs give their data chunk to the primary OSD (holding the 7th chunk). The 6 read happen in parallel and will complete when the slower OSD returns. If you have 16 OSDs instead of 6 you increase the odds of slowing the whole read down because one of them is significantly slower than the others. If you have 40 OSDs you probably don't need a sophisticated monitoring system detecting hard drive misbehavior and a slow disk could go unnoticed and degrade your performances significantly because more than a third of the objects use it (each object is using 20 OSDs total, 17 of which are for data you need to promote to the replicated pool). If you had over 1000 OSDs, you would probably need to monitor the hard drives accurately and detect slow OSDs sooner and move them out of the cluster. And only a fraction of the objects would be impacted by a slow OSD. I would love to hear what an architect would advise. Cheers Thanks, Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Loic Dachary Sent: 05 December 2014 16:23 To: Nick Fisk; 'Ceph Users' Subject: Re: [ceph-users] Erasure Encoding Chunks On 05/12/2014 16:21, Nick Fisk wrote: Hi All, Does anybody have any input on what the best ratio + total numbers of Data + Coding chunks you would choose? For example I could create a pool with 7 data chunks and 3 coding chunks and get an efficiency of 70%, or I could create a pool with 17 data chunks and 3 coding chunks and get an efficiency of 85% with a similar probability of protecting against OSD failure. What’s the reason I would choose 10 total chunks over 20 chunks, is it purely down to the overhead of having potentially double the number of chunks per object? Hi Nick, Assuming you have a large number of OSD (a thousand or more) with cold data, 20 is probably better. When you try to read the data it involves 20 OSDs instead of 10 but you probably don't care if reads are rare. Disclaimer : I'm a developer not an architect ;-) It would help to know the target use case, the size of the data set and the expected read/write rate. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] AWS SDK and MultiPart Problem
It looks like a bug. Can you open an issue on tracker.ceph.com, describing what you see? Thanks, Yehuda On Fri, Dec 5, 2014 at 7:17 AM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: It would be nice to see where and how uploadId is being calculated... Thanks, George For example if I try to perform the same multipart upload at an older version ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) I can see the upload ID in the apache log as: PUT /test/.dat?partNumber=25uploadId=I3yihBFZmHx9CCqtcDjr8d-RhgfX8NW HTTP/1.1 200 - - aws-sdk-nodejs/2.0.29 linux/v0.10.33 but when I try the same at ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) I get the following: PUT /test/.dat?partNumber=12uploadId=2%2Ff9UgnHhdK0VCnMlpT-XA8ttia1HjK36 HTTP/1.1 403 78 - aws-sdk-nodejs/2.0.29 linux/v0.10.33 and my guess is that the %2F at the latter is the one that is causing the problem and hence the 403 error. What do you think??? Best, George Hi all! I am using AWS SDK JS v.2.0.29 to perform a multipart upload into Radosgw with ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403 error. I believe that the id which is send to all requests and has been urlencoded by the aws-sdk-js doesn't match with the one in rados because it's not urlencoded. Is that the case? Can you confirm it? Is there something I can do? Regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] experimental features
A while back we merged Haomai's experimental OSD backend KeyValueStore. We named the config option 'keyvaluestore_dev', hoping to make it clear to users that it was still under development, not fully tested, and not yet ready for production. In retrospect, I don't think '_dev' was sufficiently scary because many users tried it and ran into unexpectd trouble. There are several other features we've recently added or are considering adding that fall into this category. Having them in the tree is great because it streamlines QA and testing, but I want to make sure that users are not able to enable the features without being aware of the risks. A few possible suggestions: - scarier option names, like osd objectstore = keyvaluestore_experimental_danger_danger ms type = async_experimental_danger_danger ms type = xio_experimental_danger_danger Once the feature becomes stable, they'll have to adjust their config, or we'll need to support both names going forward. - a separate config option that allows any experimental option allow experimental features danger danger = true osd objectstore = keyvaluestore ms type = xio This runs the risk that the user will enable experimental features to get X, and later start using Y without realizing Y is also experiemental. - enumerate experiemntal options we want to enable allow experimental features danger danger = keyvaluestore, xio ms type = xio osd objectstore = keyvaluestore This has the property that no config change is necessary when the feature drops its experimental status. In all of these cases, we can also make a point of sending something to the log on daemon startup. I don't think too many people will notice this, but it is better than nothing. Other ideas? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] experimental features
* On 05 Dec 2014, Sage Weil wrote: adding that fall into this category. Having them in the tree is great because it streamlines QA and testing, but I want to make sure that users are not able to enable the features without being aware of the risks. A few possible suggestions: - scarier option names, like - a separate config option that allows any experimental option - enumerate experiemntal options we want to enable Other ideas? A separate config file for experimental options: /etc/ceph/danger-danger.conf -- David Champion • d...@uchicago.edu • University of Chicago ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] experimental features
On 12/05/2014 11:39 AM, Gregory Farnum wrote: On Fri, Dec 5, 2014 at 9:36 AM, Sage Weil sw...@redhat.com wrote: A while back we merged Haomai's experimental OSD backend KeyValueStore. We named the config option 'keyvaluestore_dev', hoping to make it clear to users that it was still under development, not fully tested, and not yet ready for production. In retrospect, I don't think '_dev' was sufficiently scary because many users tried it and ran into unexpectd trouble. There are several other features we've recently added or are considering adding that fall into this category. Having them in the tree is great because it streamlines QA and testing, but I want to make sure that users are not able to enable the features without being aware of the risks. A few possible suggestions: - scarier option names, like osd objectstore = keyvaluestore_experimental_danger_danger ms type = async_experimental_danger_danger ms type = xio_experimental_danger_danger Once the feature becomes stable, they'll have to adjust their config, or we'll need to support both names going forward. - a separate config option that allows any experimental option allow experimental features danger danger = true osd objectstore = keyvaluestore ms type = xio This runs the risk that the user will enable experimental features to get X, and later start using Y without realizing Y is also experiemental. - enumerate experiemntal options we want to enable allow experimental features danger danger = keyvaluestore, xio ms type = xio osd objectstore = keyvaluestore This has the property that no config change is necessary when the feature drops its experimental status. In all of these cases, we can also make a point of sending something to the log on daemon startup. I don't think too many people will notice this, but it is better than nothing. Other ideas? I don't think these should even be going into release packages for users to work with. We can build them on the dev gitbuilders for QA and testing without them ever reaching the hands of users grabbing our production packages. ;) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I'm in favor of the allow experimental features but instead call it: ALLOW UNRECOVERABLE DATA CORRUPTING FEATURES which makes things a little more explicit. With great power comes great responsibility. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] experimental features
On 12/05/2014 11:47 AM, David Champion wrote: * On 05 Dec 2014, Sage Weil wrote: adding that fall into this category. Having them in the tree is great because it streamlines QA and testing, but I want to make sure that users are not able to enable the features without being aware of the risks. A few possible suggestions: - scarier option names, like - a separate config option that allows any experimental option - enumerate experiemntal options we want to enable Other ideas? A separate config file for experimental options: /etc/ceph/danger-danger.conf One of the questions I have in this is once you've enabled experimental features, should the cluster be considered experimental forever, even after the feature has become stable? Maybe some kind of subtle corruption has worked it's way in it will take a while to manifest. It seems to me like if you've enabled experimental features on a cluster that all bets are off. It seems to me like having the features in a separate ceph.conf file would imply that you just get rid of the danger.conf file and things are back to normal, but that's not really how it is imho. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw with SSL enabled
Hello - I have rados gateway setup working with http. But when I enable SSL on gateway node, I am having trouble making successful swift requests over https. root@hrados:~# swift -V 1.0 -A https://hrados1.ex.com/auth/v1.0 -U s3User:swiftUser -K 8fJfd6YW2poqhvBI+uUYJZE1uscnmrDncRXrkjHR list[Errno bad handshake] [('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')] Output of CURL command is as follows. root@hrados:~# curl --insecure -X GET -i -H X-Auth-Key:8fJfd6YW2poqhvBI+uUYJZE1uscnmrDncRXrkjHR -H X-Auth-User:s3User:swiftUser https://hrados1.ex.com/auth/v1.0HTTP/1.1 204 No ContentDate: Fri, 05 Dec 2014 17:53:58 GMTServer: Apache/2.4.10 (Debian)X-Storage-Url: https://hrados1.ex.com/swift/v1X-Storage-Token: AUTH_rgwtk10007333557365723a737769667455736572961633914ab868f0b6428354483a6b08fc254e33b1283ed9f428c61436aa05c0f44069d8X-Auth-Token: AUTH_rgwtk10007333557365723a737769667455736572961633914ab868f0b6428354483a6b08fc254e33b1283ed9f428c61436aa05c0f44069d8Content-Type: application/json Appreciate your help.Thanks,Lakshmi. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] experimental features
On Fri, Dec 5, 2014 at 9:36 AM, Sage Weil sw...@redhat.com wrote: A while back we merged Haomai's experimental OSD backend KeyValueStore. We named the config option 'keyvaluestore_dev', hoping to make it clear to users that it was still under development, not fully tested, and not yet ready for production. In retrospect, I don't think '_dev' was sufficiently scary because many users tried it and ran into unexpectd trouble. There are several other features we've recently added or are considering adding that fall into this category. Having them in the tree is great because it streamlines QA and testing, but I want to make sure that users are not able to enable the features without being aware of the risks. A few possible suggestions: - scarier option names, like osd objectstore = keyvaluestore_experimental_danger_danger ms type = async_experimental_danger_danger ms type = xio_experimental_danger_danger Once the feature becomes stable, they'll have to adjust their config, or we'll need to support both names going forward. - a separate config option that allows any experimental option allow experimental features danger danger = true osd objectstore = keyvaluestore ms type = xio This runs the risk that the user will enable experimental features to get X, and later start using Y without realizing Y is also experiemental. - enumerate experiemntal options we want to enable allow experimental features danger danger = keyvaluestore, xio ms type = xio osd objectstore = keyvaluestore This has the property that no config change is necessary when the feature drops its experimental status. In all of these cases, we can also make a point of sending something to the log on daemon startup. I don't think too many people will notice this, but it is better than nothing. Other ideas? I don't think these should even be going into release packages for users to work with. We can build them on the dev gitbuilders for QA and testing without them ever reaching the hands of users grabbing our production packages. ;) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com wrote: Hello, This morning I decided to reboot a storage node (Debian Jessie, thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) after applying some changes. It came back up one OSD short, the last log lines before the reboot are: --- 2014-12-05 09:35:27.700330 7f87e789c700 2 -- 10.0.8.21:6823/29520 10.0.8.22:0/5161 pipe(0x7f881b772580 sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0) Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289 n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288 pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active] cancel_copy_ops --- Quite obviously it didn't complete its shutdown, so unsurprisingly we get: --- 2014-12-05 09:37:40.278128 7f218a7037c0 1 journal _open /var/lib/ceph/osd/ceph-4/journal fd 24: 1269312 bytes, block size 4096 bytes, directio = 1, aio = 1 2014-12-05 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1 filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init: unable to mount object store 2014-12-05 09:37:40.776223 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid argument ESC[0m --- Thankfully this isn't production yet and I was eventually able to recover the OSD by re-creating the journal (ceph-osd -i 4 --mkjournal), but it leaves me with a rather bad taste in my mouth. So the pertinent questions would be: 1. What caused this? My bet is on the evil systemd just pulling the plug before the poor OSD had finished its shutdown job. 2. How to prevent it from happening again? Is there something the Ceph developers can do with regards to init scripts? Or is this something to be brought up with the Debian maintainer? Debian is transiting from sysv-init to systemd (booo!) with Jessie, but the OSDs still have a sysvinit magic file in their top directory. Could this have an affect on things? 3. Is it really that easy to trash your OSDs? In the case a storage node crashes, am I to expect most if not all OSDs or at least their journals to require manual loving? So this can't happen. Being force killed definitely can't kill the OSD's disk state; that's the whole point of the journaling. The error message indicates that the header written on disk is nonsense to the OSD, which means that the local filesystem or disk lost something somehow (assuming you haven't done something silly like downgrading the software version it's running) and doesn't know it (if there had been a read error the output would be different). I'd double-check your disk settings etc just to be sure, and check for known issues with xfs on Jessie. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] experimental features
I prefer the third option (enumeration). I don't see a point where we would enable experimental features on our production clusters, but it would be nice to have the same bits and procedures between our dev/beta and production clusters. On Fri, Dec 5, 2014 at 10:36 AM, Sage Weil sw...@redhat.com wrote: A while back we merged Haomai's experimental OSD backend KeyValueStore. We named the config option 'keyvaluestore_dev', hoping to make it clear to users that it was still under development, not fully tested, and not yet ready for production. In retrospect, I don't think '_dev' was sufficiently scary because many users tried it and ran into unexpectd trouble. There are several other features we've recently added or are considering adding that fall into this category. Having them in the tree is great because it streamlines QA and testing, but I want to make sure that users are not able to enable the features without being aware of the risks. A few possible suggestions: - scarier option names, like osd objectstore = keyvaluestore_experimental_danger_danger ms type = async_experimental_danger_danger ms type = xio_experimental_danger_danger Once the feature becomes stable, they'll have to adjust their config, or we'll need to support both names going forward. - a separate config option that allows any experimental option allow experimental features danger danger = true osd objectstore = keyvaluestore ms type = xio This runs the risk that the user will enable experimental features to get X, and later start using Y without realizing Y is also experiemental. - enumerate experiemntal options we want to enable allow experimental features danger danger = keyvaluestore, xio ms type = xio osd objectstore = keyvaluestore This has the property that no config change is necessary when the feature drops its experimental status. In all of these cases, we can also make a point of sending something to the log on daemon startup. I don't think too many people will notice this, but it is better than nothing. Other ideas? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] experimental features
On Sat, Dec 6, 2014 at 4:36 AM, Sage Weil sw...@redhat.com wrote: - enumerate experiemntal options we want to enable ... This has the property that no config change is necessary when the feature drops its experimental status. It keeps the risky options in one place too so easier to spot. In all of these cases, we can also make a point of sending something to the log on daemon startup. I don't think too many people will notice this, but it is better than nothing. Perhaps change the cluster health status to FRAGILE? or AT_RISK? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Old OSDs on new host, treated as new?
Hi, perhaps an stupid question, but why you change the hostname? Not tried, but I guess if you boot the node with an new hostname, the old hostname are in the crush map, but without any OSDs - because they are on the new host. Don't know ( I guess not) if the degration level stay also on 5% if you delete the empty host from the crush map. I would simply use the same hostconfig on an rebuildet host. Udo On 03.12.2014 05:06, Indra Pramana wrote: Dear all, We have a Ceph cluster with several nodes, each node contains 4-6 OSDs. We are running the OS off USB drive to maximise the use of the drive bays for the OSDs and so far everything is running fine. Occasionally, the OS running on the USB drive would fail, and we would normally replace the drive with a pre-configured similar OS and Ceph running, so when the new OS boots up, it will automatically detect all the OSDs and start them. It works fine without any issues. However, the issue is in recovery. When one node goes down, all the OSDs would be down and recovery will start to move the pg replicas on the affected OSDs to other available OSDs, and cause the Ceph to be degraded, say 5%, which is expected. However, when we boot up the failed node with a new OS, and bring back the OSDs up, more PGs are being scheduled for backfilling and instead of reducing, the degradation level will shoot up again to, for example, 10%, and in some occasion, it goes up to 19%. We had experience when one node is down, it will degraded to 5% and recovery will start, but when we manage to bring back up the node (still the same OS), the degradation level will reduce to below 1% and eventually recovery will be completed faster. Why the same behaviour doesn't apply on the above situation? The OSD numbers are the same when the node boots up, the crush map weight values are also the same. Only the hostname is different. Any advice / suggestions? Looking forward to your reply, thank you. Cheers. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs survey results
Hi, if I have a situation when each node in a cluster writes their own files in cephfs, is it safe to use multiple MDS ? I mean, is the problem using multiple MDS related to nodes writing same files ? thanks, -lorieri On Tue, Nov 4, 2014 at 9:47 PM, Shain Miley smi...@npr.org wrote: +1 for fsck and snapshots, being able to have snapshot backups and protect against accidental deletion, etc is something we are really looking forward to. Thanks, Shain On 11/04/2014 04:02 AM, Sage Weil wrote: On Tue, 4 Nov 2014, Blair Bethwaite wrote: On 4 November 2014 01:50, Sage Weil s...@newdream.net wrote: In the Ceph session at the OpenStack summit someone asked what the CephFS survey results looked like. Thanks Sage, that was me! Here's the link: https://www.surveymonkey.com/results/SM-L5JV7WXL/ In short, people want fsck multimds snapshots quotas TBH I'm a bit surprised by a couple of these and hope maybe you guys will apply a certain amount of filtering on this... fsck and quotas were there for me, but multimds and snapshots are what I'd consider icing features - they're nice to have but not on the critical path to using cephfs instead of e.g. nfs in a production setting. I'd have thought stuff like small file performance and gateway support was much more relevant to uptake and positive/pain-free UX. Interested to hear others rationale here. Yeah, I agree, and am taking the results with a grain of salt. I think the results are heavily influenced by the order they were originally listed (I whish surveymonkey would randomize is for each person or something). fsck is a clear #1. Everybody wants multimds, but I think very few actually need it at this point. We'll be merging a soft quota patch shortly, and things like performance (adding the inline data support to the kernel client, for instance) will probably compete with getting snapshots working (as part of a larger subvolume infrastructure). That's my guess at least; for now, we're really focused on fsck and hard usability edges and haven't set priorities beyond that. We're definitely interested in hearing feedback on this strategy, and on peoples' experiences with giant so far... sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote: On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com wrote: Hello, This morning I decided to reboot a storage node (Debian Jessie, thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) after applying some changes. It came back up one OSD short, the last log lines before the reboot are: --- 2014-12-05 09:35:27.700330 7f87e789c700 2 -- 10.0.8.21:6823/29520 10.0.8.22:0/5161 pipe(0x7f881b772580 sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0) Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289 n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288 pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active] cancel_copy_ops --- Quite obviously it didn't complete its shutdown, so unsurprisingly we get: --- 2014-12-05 09:37:40.278128 7f218a7037c0 1 journal _open /var/lib/ceph/osd/ceph-4/journal fd 24: 1269312 bytes, block size 4096 bytes, directio = 1, aio = 1 2014-12-05 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1 filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init: unable to mount object store 2014-12-05 09:37:40.776223 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid argument ESC[0m --- Thankfully this isn't production yet and I was eventually able to recover the OSD by re-creating the journal (ceph-osd -i 4 --mkjournal), but it leaves me with a rather bad taste in my mouth. So the pertinent questions would be: 1. What caused this? My bet is on the evil systemd just pulling the plug before the poor OSD had finished its shutdown job. 2. How to prevent it from happening again? Is there something the Ceph developers can do with regards to init scripts? Or is this something to be brought up with the Debian maintainer? Debian is transiting from sysv-init to systemd (booo!) with Jessie, but the OSDs still have a sysvinit magic file in their top directory. Could this have an affect on things? 3. Is it really that easy to trash your OSDs? In the case a storage node crashes, am I to expect most if not all OSDs or at least their journals to require manual loving? So this can't happen. Good thing you quoted that, as it clearly did. ^o^ Now the question of how exactly remains to be answered. Being force killed definitely can't kill the OSD's disk state; that's the whole point of the journaling. The other OSDs got to the point where they logged journal flush done, this one didn't. Coincidence? I think not. Totally agree about the point of journaling being to prevent this kind of situation of course. The error message indicates that the header written on disk is nonsense to the OSD, which means that the local filesystem or disk lost something somehow (assuming you haven't done something silly like downgrading the software version it's running) and doesn't know it (if there had been a read error the output would be different). The journal is on an SSD, as stated. And before you ask it's on an Intel DC S3700. This was created on 0.80.7 just a day before, so no version games. I'd double-check your disk settings etc just to be sure, and check for known issues with xfs on Jessie. I'm using ext4, but that shouldn't be an issue here to begin with, as the journal is a raw SSD partition. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com