[ceph-users] shared rbd ?
Is it possible to have shared RBD ? to form a shared NFS kind of system but on ceph ? -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Running instances on ceph with openstack
Has any one tried running instances over ceph i.e using ceph as backend for vm storage . How would you get instant migration in that case since every compute host will have it's own RBD . other option is to have a big rbd pool on head node and share it with NFS to have shared file system any idea ? -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running instances on ceph with openstack
Hello Ali Shah, we are running VMs using Opennebula with ceph as the backend. So far with varying results: From time to time VMs are freezing, probably panic'ing when the load is too high on the ceph storage due to rebalance work. We are experimenting with --osd-max-backfills 1, but it hasn't solved the problem completly. Cheers, Nico Zeeshan Ali Shah [Tue, Dec 23, 2014 at 09:12:25AM +0100]: Has any one tried running instances over ceph i.e using ceph as backend for vm storage . How would you get instant migration in that case since every compute host will have it's own RBD . other option is to have a big rbd pool on head node and share it with NFS to have shared file system any idea ? -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cluster unusable
Hi, We use Ceph 0.80.7 for our IceHouse PoC. 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total. 4 pools for RBD, size=2, 512 PGs per pool Everything was fine until mid of last week, and here's what happened: - OSD node #12 passed away - AFAICR, ceph recovered fine - I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster - it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid -3.052e-05. - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening - on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg map eX wrongly marked me down every minute) - I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue - ceph kept on reorganising PGs and reached this unhealthy state: --- 900 PGs stuck unclean --- some 'requests are blocked 32 sec' --- command 'rbd info images/image_id hung --- all tested VMs hung - So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs - ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly) - but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck. Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this PoC sev1 issue. So I hope that the community can help me find a solution before Christmas. Here are the details (actual host and DC names not shown in these outputs). [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done Tue Dec 23 06:53:15 GMT 2014 0dde9837-3e45-414d-a2c5-902adee0cfe9 no reply for 2 hours, still ongoing... [root@MON ]# rbd ls images | head -5 0dde9837-3e45-414d-a2c5-902adee0cfe9 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e 3917346f-12b4-46b8-a5a1-04296ea0a826 4bde285b-28db-4bef-99d5-47ce07e2463d 7da30b4c-4547-4b4c-a96e-6a3528e03214 [root@MON ]# [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file [cloud-user@francois-vm2 ~]$ rm /tmp/file no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000' [root@MON ~]# ceph -s cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; noscrub,nodeep-scrub flag(s) set monmap e6: 3 mons at {MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 MON01,MON06,MON09 osdmap e42050: 6 osds: 6 up, 6 in flags noscrub,nodeep-scrub pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects 600 GB used, 1031 GB / 1632 GB avail 2 inactive 2045 active+clean 1 remapped+peering client io 818 B/s wr, 0 op/s [root@MON ~]# ceph health detail HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1] pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1] pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0] pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1] pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1] pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0] pg 5.b3 is remapped+peering, acting [1,0] 87 ops are blocked 67108.9 sec 16 ops are blocked 33554.4 sec 84 ops are blocked 67108.9 sec on osd.1 16 ops are blocked 33554.4 sec on osd.1 3 ops are blocked 67108.9 sec on osd.2 2 osds have slow requests noscrub,nodeep-scrub flag(s) set [root@MON]# ceph osd tree # idweight type name up/down reweight -1 1.08root default -5 0.54datacenter dc_TWO -2 0.54host node10 1 0.27osd.1 up 1 5 0.27osd.5 up 1 -4 0 host node12 -6 0.54datacenter dc_ONE -3 0.54host node11 2 0.27osd.2 up 1 3 0.27osd.3 up 1 0 0 osd.0 up 1 4 0 osd.4 up 1 (I'm concerned about the above two ghost osd.0 and osd.4...)
Re: [ceph-users] shared rbd ?
On 12/23/2014 09:13 AM, Zeeshan Ali Shah wrote: Is it possible to have shared RBD ? to form a shared NFS kind of system but on ceph ? Yes, you can use OCFS2 or GFS on top of RBD. But you also might want to look at using CephFS with version 0.90 or the upcoming hammer. In my recent tests I found that CephFS is fairly stable when using a Active/Standby MDS. I won't say that it's 100% production ready, but I would suggest you try it. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster unusable
Hi François, Could you paste somewhere the output of ceph report to check the pg dump ? (it's probably going to be a little too big for the mailing list). You can bring back osd.0 and osd.4 into the host to which they belong (instead of being at the root of the crush map) with crush set: http://ceph.com/docs/master/rados/operations/crush-map/#add-move-an-osd They won't be used by the ruleset 0 because they are not under the default bucket. To make sure this happens automagically, you may consider using osd_crush_update_on_start=true : http://ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook http://workbench.dachary.org/ceph/ceph/blob/firefly/src/upstart/ceph-osd.conf#L18 Cheers On 23/12/2014 09:56, Francois Petit wrote: Hi, We use Ceph 0.80.7 for our IceHouse PoC. 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total. 4 pools for RBD, size=2, 512 PGs per pool Everything was fine until mid of last week, and here's what happened: - OSD node #12 passed away - AFAICR, ceph recovered fine - I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster - it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid -3.052e-05. - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening - on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg map eX wrongly marked me down every minute) - I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue - ceph kept on reorganising PGs and reached this unhealthy state: --- 900 PGs stuck unclean --- some 'requests are blocked 32 sec' --- command 'rbd info images/image_id hung --- all tested VMs hung - So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs - ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly) - but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck. Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this PoC sev1 issue. So I hope that the community can help me find a solution before Christmas. Here are the details (actual host and DC names not shown in these outputs). [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done Tue Dec 23 06:53:15 GMT 2014 0dde9837-3e45-414d-a2c5-902adee0cfe9 no reply for 2 hours, still ongoing... [root@MON ]# rbd ls images | head -5 0dde9837-3e45-414d-a2c5-902adee0cfe9 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e 3917346f-12b4-46b8-a5a1-04296ea0a826 4bde285b-28db-4bef-99d5-47ce07e2463d 7da30b4c-4547-4b4c-a96e-6a3528e03214 [root@MON ]# [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file [cloud-user@francois-vm2 ~]$ rm /tmp/file no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000' [root@MON ~]# ceph -s cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; noscrub,nodeep-scrub flag(s) set monmap e6: 3 mons at {MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 MON01,MON06,MON09 osdmap e42050: 6 osds: 6 up, 6 in flags noscrub,nodeep-scrub pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects 600 GB used, 1031 GB / 1632 GB avail 2 inactive 2045 active+clean 1 remapped+peering client io 818 B/s wr, 0 op/s [root@MON ~]# ceph health detail HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1] pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1] pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0] pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1] pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1] pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0] pg 5.b3 is remapped+peering, acting [1,0] 87 ops are blocked 67108.9 sec 16 ops are
Re: [ceph-users] Cluster unusable
Hi Loïc, Thanks. Am trying to find where I can make the report available to you [root@qvitblhat06 ~]# ceph report /tmp/ceph_report report 3298035134 [root@qvitblhat06 ~]# ls -lh /tmp/ceph_report -rw-r--r--. 1 root root 4.7M Dec 23 10:38 /tmp/ceph_report [root@qvitblhat06 ~]# (Sorry guys for the unwanted ad that was sent in my first email...) Francois___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster unusable
Here you go:http://www.filedropper.com/cephreport Francois ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.90 released
Hello, so I upgraded my cluster from 89 to 90 and now I get: ~# ceph health HEALTH_WARN too many PGs per OSD (864 max 300) That is a new one. I had too few but never too many. Is this a problem that needs attention, or ignorable? Or is there even a command now to shrink PGs? The message did not appear before, I currently have 32 OSDs over 8 hosts and 9 pools, each with 1024 PG as was the recommended number according to the OSD * 100 / replica formula, then round to next power of 2. The cluster has been increased by 4 OSDs, 8th host only days before. That is to say, it was at 28 OSD / 7 hosts / 9 pools but after extending it with another host, ceph 89 did not complain. Using the formula again I'd actually need to go to 2048PGs in pools but ceph is telling me to reduce the PG count now? Kind regards René ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.90 released
On 12/23/14 12:57, René Gallati wrote: Hello, so I upgraded my cluster from 89 to 90 and now I get: ~# ceph health HEALTH_WARN too many PGs per OSD (864 max 300) That is a new one. I had too few but never too many. Is this a problem that needs attention, or ignorable? Or is there even a command now to shrink PGs? The message did not appear before, I currently have 32 OSDs over 8 hosts and 9 pools, each with 1024 PG as was the recommended number according to the OSD * 100 / replica formula, then round to next power of 2. The cluster has been increased by 4 OSDs, 8th host only days before. That is to say, it was at 28 OSD / 7 hosts / 9 pools but after extending it with another host, ceph 89 did not complain. Using the formula again I'd actually need to go to 2048PGs in pools but ceph is telling me to reduce the PG count now? formula recommends PG count for all pools, not each pool. So you need about 2048 PGs total distributed by expected pool size. from http://ceph.com/docs/master/rados/operations/placement-groups/: When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD so that you arrive at a reasonable total number of placement groups that provides reasonably low variance per OSD without taxing system resources or making the peering process too slow. Kind regards René ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running instances on ceph with openstack
Thank Gallati, which means we donot need have to have shared rbd for live migration ? On Tue, Dec 23, 2014 at 11:47 AM, René Gallati c...@gallati.net wrote: Hello, On 23.12.2014 09:12, Zeeshan Ali Shah wrote: Has any one tried running instances over ceph i.e using ceph as backend for vm storage . How would you get instant migration in that case since every compute host will have it's own RBD . other option is to have a big rbd pool on head node and share it with NFS to have shared file system When you use shared block devices, then the compute nodes don't need to have a shared file system in openstack. All their (runtime)information comes from either config files or the controller node/apis. They mount RBDs and they contact each other in the case of a live migration so there is a sort of handover protocol, at least when you use libvirt+qemu as hypervisor. How this is set up is described in http://ceph.com/docs/next/rbd/rbd-openstack/ Kind regards René -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Behaviour of a cluster with full OSD(s)
I understand that the status osd full should never be reached. As I am new to ceph I want to be prepared for this case. I tried two different scenarios and here are my experiences: The first one is to completely fill the storage (for me: writing files to a rados blockdevice). I discovered that the writing client (dd for example) gets completly stucked then. And this prevents me from stoping the process (SIGTERM, SIGKILL). At the moment I restart the whole computer to prevent writing to the cluster. Then I unmap the rbd device and set the full ratio a bit higher (0.95 to 0.97). I do a mount on my adminnode and delete files till everything is okay again. Is this the best practice? Is it possible to prevent the system from running in a osd full state? I could make the block devices smaller than the cluster can save. But it's hard to calculate this exactly. The next scenario is to change a pool size from say 2 to 3 replicas. While the cluster copies the objects it gets stuck as an osd reaches it limit. Normally the osd process quits then and I cannot restart it (even after setting the replicas back). The only possibility is to manually delete complete PG folders after exploring them with 'pg dump'. Is this the only way to get it back working again? Greetings! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Best way to simulate SAN masking/mapping with CEPH
Hi Users List We have a SAN solution with zoning/masking/mapping to segregate LUN allocation avoid security access issue (server srv01 can access on srv02 luns) I think with CEPH we can only put security on pool side right ? We can’t drill down to LUN with client security file like below : client.serv01 mon 'allow r' osd 'allow rwx pool=serv01/lununxprd01' So what’s for you the recommandation for my usecase : 1 pool per server / per cluster ? Do we have number pool limitation ? Thanks Florent Monthel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD JOURNAL not associated - ceph-disk list ?
Hi Loic, Hum… I will check. However symlink journal to partition is correctly created without action on my side : journal - /dev/disk/by-partuuid/36741e5b-eee0-4368-9736-a31701a186a1 But no journal_uuid file with cep-deploy Florent Monthel Le 23 déc. 2014 à 00:51, Loic Dachary l...@dachary.org a écrit : Hi Florent, On 22/12/2014 19:49, Florent MONTHEL wrote: Hi Loic, Hi Robert, Thanks. I’m integrating CEPH OSD with OpenSVC services (http://www.opensvc.com) so I have to generate UUID myself in order to map services It’s the reason for that I’m generating sgdisk commands with my own UUID After activating OSD, I don’t have mapping osd journal with cep-disk command root@raven:/var/lib/ceph/osd/ceph-5# ceph-disk list /dev/sda other, ext4, mounted on / /dev/sdb swap, swap /dev/sdc : /dev/sdc1 ceph journal /dev/sdd : /dev/sdd1 ceph data, active, cluster ceph, osd.3 /dev/sde : /dev/sde1 ceph journal /dev/sdf : /dev/sdf1 ceph data, active, cluster ceph, osd.4 /dev/sdg : /dev/sdg1 ceph journal /dev/sdh : /dev/sdh1 ceph data, active, cluster ceph, osd.5 After below command (osd 5), ceph-deploy didn’t create file journal_uuid : ceph-deploy --overwrite-conf osd create raven:/dev/disk/by-partuuid/6356fd8d-0d84-432a-b9f4-3d02f94afdff:/dev/disk/by-partuuid/36741e5b-eee0-4368-9736-a31701a186a1 root@raven:/var/lib/ceph/osd/ceph-5# ls -l total 56 -rw-r--r-- 1 root root 192 Dec 21 23:55 activate.monmap -rw-r--r-- 1 root root3 Dec 21 23:55 active -rw-r--r-- 1 root root 37 Dec 21 23:55 ceph_fsid drwxr-xr-x 184 root root 8192 Dec 22 19:25 current -rw-r--r-- 1 root root 37 Dec 21 23:55 fsid lrwxrwxrwx 1 root root 58 Dec 21 23:55 journal - /dev/disk/by-partuuid/36741e5b-eee0-4368-9736-a31701a186a1 -rw--- 1 root root 56 Dec 21 23:55 keyring -rw-r--r-- 1 root root 21 Dec 21 23:55 magic -rw-r--r-- 1 root root6 Dec 21 23:55 ready -rw-r--r-- 1 root root4 Dec 21 23:55 store_version -rw-r--r-- 1 root root 53 Dec 21 23:55 superblock -rw-r--r-- 1 root root0 Dec 22 19:24 sysvinit -rw-r--r-- 1 root root2 Dec 21 23:55 whoami So I created for each osd, file journal_uuid » manually and mapping become OK with ceph-disk :) root@raven:/var/lib/ceph/osd/ceph-5# echo 36741e5b-eee0-4368-9736-a31701a186a1 » journal_uuid I think this is an indication that when you ceph-disk prepare the device the journal_uuid was not provided and therefore the journal_uuid creation was skipped: http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1235 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1235 called from http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1338 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L1338 Cheers It’s ok now : root@raven:/var/lib/ceph/osd/ceph-5# ceph-disk list /dev/sda other, ext4, mounted on / /dev/sdb swap, swap /dev/sdc : /dev/sdc1 ceph journal, for /dev/sdd1 /dev/sdd : /dev/sdd1 ceph data, active, cluster ceph, osd.3, journal /dev/sdc1 /dev/sde : /dev/sde1 ceph journal, for /dev/sdf1 /dev/sdf : /dev/sdf1 ceph data, active, cluster ceph, osd.4, journal /dev/sde1 /dev/sdg : /dev/sdg1 ceph journal, for /dev/sdh1 /dev/sdh : /dev/sdh1 ceph data, active, cluster ceph, osd.5, journal /dev/sdg1 Thanks rob...@leblancnet.us mailto:rob...@leblancnet.us mailto:rob...@leblancnet.us mailto:rob...@leblancnet.us for clue ;) *Florent Monthel** * Le 21 déc. 2014 à 18:08, Loic Dachary l...@dachary.org mailto:l...@dachary.org mailto:l...@dachary.org mailto:l...@dachary.org a écrit : Hi Florent, It is unusual to manually run the sgdisk. Is there a reason why you need to do this instead of letting ceph-disk prepare do it for you ? The information about the association between journal and data is only displayed when the OSD has been activated. See http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L2246 http://workbench.dachary.org/ceph/ceph/blob/giant/src/ceph-disk#L2246 Cheers On 21/12/2014 15:11, Florent MONTHEL wrote: Hi, I would like to separate OSD and journal on 2 différent disks so I have : 1 disk /dev/sde (1GB) for journal = type code JOURNAL_UUID = '45b0969e-9b03-4f30-b4c6-b4b80ceff106' 1 disk /dev/sdd (5GB) for OSD = type code OSD_UUID = '4fbd7e29-9d25-41b8-afd0-062c0ceff05d' I execute below commands : FOR JOURNAL : sgdisk --new=1:0:1023M --change-name=1:ceph journal --partition-guid=1:e89f18cc-ae46-4573-8bca-3e782d45849c --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 -- /dev/sde FOR OSD: sgdisk --new=1:0:5119M --change-name=1:ceph data --partition-guid=1:7476f0a8-a6cd-4224-b64b-a4834c32a73e --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdd And I'm preparing OSD : ceph-disk prepare --osd-uuid 7476f0a8-a6cd-4224-b64b-a4834c32a73e --journal-uuid
Re: [ceph-users] v0.90 released
Hello, On 23.12.2014 12:14, Henrik Korkuc wrote: On 12/23/14 12:57, René Gallati wrote: Hello, so I upgraded my cluster from 89 to 90 and now I get: ~# ceph health HEALTH_WARN too many PGs per OSD (864 max 300) That is a new one. I had too few but never too many. Is this a problem that needs attention, or ignorable? Or is there even a command now to shrink PGs? The message did not appear before, I currently have 32 OSDs over 8 hosts and 9 pools, each with 1024 PG as was the recommended number according to the OSD * 100 / replica formula, then round to next power of 2. The cluster has been increased by 4 OSDs, 8th host only days before. That is to say, it was at 28 OSD / 7 hosts / 9 pools but after extending it with another host, ceph 89 did not complain. Using the formula again I'd actually need to go to 2048PGs in pools but ceph is telling me to reduce the PG count now? formula recommends PG count for all pools, not each pool. So you need about 2048 PGs total distributed by expected pool size. from http://ceph.com/docs/master/rados/operations/placement-groups/: When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD so that you arrive at a reasonable total number of placement groups that provides reasonably low variance per OSD without taxing system resources or making the peering process too slow. Ah I've seem to have overlooked this. Lucky for me, I had 5 pools exclusively for testing purposes and another that was not in use - killing those put me under the complaint threshold. In this case, Giant 90 is the first version that actually complains about too many PGs per OSD it appears. What I don't like that much about this soft limitation is the fact that PGs are defined per pool, which means that just adding a new pool is not as straight forward as I thought it was. If you are already somewhere near the limit, all you can do is make a new pool with low PG count, thus potentially make that pool less well distributed than all the pools that came before. But perhaps the overhead incurred with higher PG numbers isn't that bad anyway - after all it ran well up until now. Kind regards René ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] erasure coded pool k=7,m=5
Hi all, Soon, we should have a 3 datacenters (dc) ceph cluster with 4 hosts in each dc. Each host will have 12 OSD. We can accept the loss of one datacenter and one host on the remaining 2 datacenters. In order to use erasure coded pool : 1. Is the solution for a strategy k = 7, m = 5 is acceptable ? 2. Is this is the only one that guarantees us our premise ? 3. And more generally, is there a formula (based on the number of dc, host and OSD) that allows us to calculate the profile ? Thanks. Stephane. -- Université de Lorraine Stéphane DUGRAVOT - Direction du numérique - Infrastructure Jabber : stephane.dugra...@univ-lorraine.fr Tél.: +33 3 83 68 20 98 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] erasure coded pool k=7,m=5
Hi Stéphane, On 23/12/2014 14:34, Stéphane DUGRAVOT wrote: Hi all, Soon, we should have a 3 datacenters (dc) ceph cluster with 4 hosts in each dc. Each host will have 12 OSD. We can accept the loss of one datacenter and one host on the remaining 2 datacenters. In order to use erasure coded pool : 1. Is the solution for a strategy k = 7, m = 5 is acceptable ? If you want to sustain the loss of one datacenter, k=2,m=1 is what you want, with a ruleset that require that no two shards must be in the same datacenter. It also sustains the loss of one host within a datacenter: the missing chunk on the lost host will be reconstructed using the two other chunks from the two other datacenter. If, in addition, you want to sustain the loss of one machine while a datacenter is down, you would need to use the LRC plugin. 2. Is this is the only one that guarantees us our premise ? 3. And more generally, is there a formula (based on the number of dc, host and OSD) that allows us to calculate the profile ? I don't think there is such a formula. Cheers Thanks. Stephane. -- *Université de Lorraine**/ /*Stéphane DUGRAVOT - Direction du numérique - Infrastructure Jabber : /stephane.dugra...@univ-lorraine.fr/ Tél.: /+33 3 83 68 20 98/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of a cluster with full OSD(s)
Max, List, Max Power [Tue, Dec 23, 2014 at 12:34:54PM +0100]: [...Recovering from full osd ...] Normally the osd process quits then and I cannot restart it (even after setting the replicas back). The only possibility is to manually delete complete PG folders after exploring them with 'pg dump'. Is this the only way to get it back working again? I was wondering if ceph-osd crashing when the disk gets full shouldn't be considered being a bug? Shouldn't ceph osd be able to recover itself? Like if an admin detects that the disk is full, she can simply reduce the weight of the osd to free up space. With a dead osd, this is not possible. To those having deeper ceph knowledge: For what reason does ceph-osd exit when the disk is full? Why can it not start when it is full to get itself out of this invidious situation? Cheers, Nico -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Balancing erasure crush rule
I’m trying to set up an erasure coded pool with k=9 m=6 on 13 osd hosts. I’m trying to write a crush rule for this which will balance this between hosts as much as possible. I understand that having 9+6=15 13, I will need to parse the tree twice in order to find enough pgs. So what I’m trying to do is select ~1 from each host on the first pass, and then select n more osds to fill it out, without using any osds from the first pass, and preferably balancing them between racks. For starters, I don't know if this is even possible or if its the right approach to what I'm trying to do, but heres my attempt: rule .us-phx.rgw.buckets.ec { ruleset 1 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step take default step chooseleaf indep 0 type host step emit step take default step chooseleaf indep 0 type rack step emit } This gets me pretty close, the first pass works great and the second pass does a nice balance between racks, but in my testing ~ 6 out of 1000 pgs will have two osds in their group. I'm guessing I need to get down to one pass to make sure that doesn't happen, but I'm having a hard time sorting out how to hit the requirement of balancing among hosts *and* allowing for more than one osd per host. Thanks, Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.90 released
On Tue, 23 Dec 2014, Ren? Gallati wrote: Hello, so I upgraded my cluster from 89 to 90 and now I get: ~# ceph health HEALTH_WARN too many PGs per OSD (864 max 300) That is a new one. I had too few but never too many. Is this a problem that needs attention, or ignorable? Or is there even a command now to shrink PGs? It's a new warning. You can't reduce the PG count without creating new (smaller) pools and migrating data. You can ignore the message, though, and make it go away by adjusting the 'mon pg warn max per osd' (defaults to 300). Having too many PGs increases the memory utilization and can slow things down when adapting to a failure, but certainly isn't fatal. The message did not appear before, I currently have 32 OSDs over 8 hosts and 9 pools, each with 1024 PG as was the recommended number according to the OSD * 100 / replica formula, then round to next power of 2. The cluster has been increased by 4 OSDs, 8th host only days before. That is to say, it was at 28 OSD / 7 hosts / 9 pools but after extending it with another host, ceph 89 did not complain. Using the formula again I'd actually need to go to 2048PGs in pools but ceph is telling me to reduce the PG count now? The guidance in the docs is (was?) a bit confusing. You need to take the *total* number of PGs and see how many of those per OSD there are, not create as many equally-sized pools as you want. There have been several attempts to clarify the language to avoid this misunderstanding (you're definitely not the first). If it's still unclear, suggestions welcome! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.90 released
Hi Sage, Am 23.12.2014 15:39, schrieb Sage Weil: ... You can't reduce the PG count without creating new (smaller) pools and migrating data. does this also work with the pool metadata, or is this pool essential for ceph? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD pool with unfound objects
Hi all, I have some questions about unfound objects in rbd pool, what's the real impact to the rbd image? Currently our cluster (running on v0.80.5) has 25 unfound objects due to recent OSD crashes, and cannot mark as lost yet (Bug #10405 created for this). So far it seems we can still mount the rbd image (filesystem is xfs) but I would like to know the real impact 1.My guess it should like bad sector of a real hard disk? 2.Is there any way to identify which file get impacted of the RBD disk? 3.What if we mark it as lost using ceph pg pg mark_unfound_lost revert revert / delete? 4.Is it better to copy current rbd image to another new one and use the new one instead? Any suggestion to current situation is also welcome that we need keep the data inside this RBD. Thanks in advance, BR, Luke This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancing erasure crush rule
After some more work i realized that didn't get me closer at all. It was still only selecting 13 osds *and* still occasionally re-selecting the same one. I think the multiple emit/takes isn't working like I expect. Given: step take default step chooseleaf indep 0 type host step emit step take default step chooseleaf indep 0 type host step emit In a rule, I would expect it to try to select ~1 osd per host once, and then start over again. Instead, what I'm seeing is it selects ~1 osd per host and then when it starts again, it re-selects those same osds, resulting in multiple placements on 2 or 3 osds per pg. It turns out what I'm trying to do is described here: https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg01076.html But I can't find any other references to anything like this. Thanks, Aaron On Dec 23, 2014, at 9:23 AM, Aaron Bassett aa...@five3genomics.com wrote: I’m trying to set up an erasure coded pool with k=9 m=6 on 13 osd hosts. I’m trying to write a crush rule for this which will balance this between hosts as much as possible. I understand that having 9+6=15 13, I will need to parse the tree twice in order to find enough pgs. So what I’m trying to do is select ~1 from each host on the first pass, and then select n more osds to fill it out, without using any osds from the first pass, and preferably balancing them between racks. For starters, I don't know if this is even possible or if its the right approach to what I'm trying to do, but heres my attempt: rule .us-phx.rgw.buckets.ec { ruleset 1 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step take default step chooseleaf indep 0 type host step emit step take default step chooseleaf indep 0 type rack step emit } This gets me pretty close, the first pass works great and the second pass does a nice balance between racks, but in my testing ~ 6 out of 1000 pgs will have two osds in their group. I'm guessing I need to get down to one pass to make sure that doesn't happen, but I'm having a hard time sorting out how to hit the requirement of balancing among hosts *and* allowing for more than one osd per host. Thanks, Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Online converting of pool type
Hi, Every now and then someone asks if it's possible to convert a pool to a different type (replicated vs erasure / change the amount of pg's / etc), but this is not supported. The advised approach is usually to just create a new pool and somehow copy all data manually to this new pool, removing the old pool afterwards. This is both unpractical and very time consuming. Recently I saw someone on this list suggest that the cache tiering feature may actually be used to achieve some form of online converting of pool types. Today I ran some tests and I would like to share my results. I started out with a pool test-A, created an rbd image in the pool, mapped it, created a filesystem in the rbd image, mounted the fs and placed some test files in the fs. Just to have some objects in the test-A pool. I then added a test-B pool and transferred the data using cache tiering as follows: Step 0: We have a test-A pool and it contains data, some of which is in use. # rados -p test-A df test-A - 9941 110 0 0 324 2404 57 4717 Step 1: Create new pool test-B # ceph osd pool create test-B 32 pool 'test-B' created Step 2: Make pool test-A a cache pool for test-B. # ceph osd tier add test-B test-A --force-nonempty # ceph osd tier cache-mode test-A forward Step 3: Move data from test-A to test-B (this potentially takes long) # rados -p test-A cache-flush-evict-all This step will move all data except the objects that are in active use, so we are left with some remaining data on test-A pool. Step 4: Move also the remaining data. This is the only step that doesn't work online. Step 4a: Disconnect all clients # rbd unmap /dev/rbd/test-A/test-rbd (in my case) Stab 4b: Move remaining objects # rados -p test-A cache-flush-evict-all # rados -p test-A ls (should now be empty) Step 5: Remove test-A as cache pool # ceph osd tier remove test-B test-A Step 6: Clients are allowed to connect with test-B pool (we are back in online mode) # rbd map test-B/test-rbd (in my case) Step 7: Remove the now empty pool test-A # ceph osd pool delete test-A test-A --yes-i-really-really-mean-it This worked smoothly. In my first try I actually used more steps, by creatig ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online converting of pool type
Whoops, I accidently sent my mail before it was finished. Anyway I have some more testing to do, especially with converting between erasure/replicated pools. But it looks promising. Thanks, Erik. On 23-12-14 16:57, Erik Logtenberg wrote: Hi, Every now and then someone asks if it's possible to convert a pool to a different type (replicated vs erasure / change the amount of pg's / etc), but this is not supported. The advised approach is usually to just create a new pool and somehow copy all data manually to this new pool, removing the old pool afterwards. This is both unpractical and very time consuming. Recently I saw someone on this list suggest that the cache tiering feature may actually be used to achieve some form of online converting of pool types. Today I ran some tests and I would like to share my results. I started out with a pool test-A, created an rbd image in the pool, mapped it, created a filesystem in the rbd image, mounted the fs and placed some test files in the fs. Just to have some objects in the test-A pool. I then added a test-B pool and transferred the data using cache tiering as follows: Step 0: We have a test-A pool and it contains data, some of which is in use. # rados -p test-A df test-A - 9941 110 0 0 324 2404 57 4717 Step 1: Create new pool test-B # ceph osd pool create test-B 32 pool 'test-B' created Step 2: Make pool test-A a cache pool for test-B. # ceph osd tier add test-B test-A --force-nonempty # ceph osd tier cache-mode test-A forward Step 3: Move data from test-A to test-B (this potentially takes long) # rados -p test-A cache-flush-evict-all This step will move all data except the objects that are in active use, so we are left with some remaining data on test-A pool. Step 4: Move also the remaining data. This is the only step that doesn't work online. Step 4a: Disconnect all clients # rbd unmap /dev/rbd/test-A/test-rbd (in my case) Stab 4b: Move remaining objects # rados -p test-A cache-flush-evict-all # rados -p test-A ls (should now be empty) Step 5: Remove test-A as cache pool # ceph osd tier remove test-B test-A Step 6: Clients are allowed to connect with test-B pool (we are back in online mode) # rbd map test-B/test-rbd (in my case) Step 7: Remove the now empty pool test-A # ceph osd pool delete test-A test-A --yes-i-really-really-mean-it This worked smoothly. In my first try I actually used more steps, by creatig ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
If you're intent is to learn Ceph, then I suggest that you set up three or four VMs to learn how all the components work together. Then you will know better how to put different components together and you can decide which combination works better for you. I don't like any of those components in the same OS because they can interfere with each other pretty bad. Putting them in VMs gets around some of the possible deadlocks but then there is usually not enough disk IO. That is my $0.02. Robert LeBlanc Sent from a mobile device please excuse any typos. On Dec 23, 2014 6:12 AM, Debashish Das deba@gmail.com wrote: Hi, Thanks for the replies, I have some more queries now :-) 1. I have one 64 bit Physical Server (4 GB RAM, QuadCore 250 GB HDD) One VM (not a high end one). I want to install ceph-mon, ceph-osd ceph RBD (Rados Block Device). Can you please tell me if it is possible to only install ceph-mon ceph RBD in one VM ceph-osd in Physical Machine? Or do you have any other idea how to proceed with my current hardware resources? Please also let me know any reference links which I can refer for this kind of installation. I am not sure which component (mon/osd/RBD) should I install in which setup ( VM/Physical Server). Your expert opinion would be of great help for me. Thank You. Kind Regards Debashish Das On Sat, Dec 20, 2014 at 12:00 AM, Craig Lewis cle...@centraldesktop.com wrote: I've done single nodes. I have a couple VMs for RadosGW Federation testing. It has a single virtual network, with both clusters on the same network. Because I'm only using a single OSD on a single host, I had to update the crushmap to handle that. My Chef recipe runs: ceph osd getcrushmap -o /tmp/compiled-crushmap.old crushtool -d /tmp/compiled-crushmap.old -o /tmp/decompiled-crushmap.old sed -e '/step chooseleaf firstn 0 type/s/host/osd/' /tmp/decompiled-crushmap.old /tmp/decompiled-crushmap.new crushtool -c /tmp/decompiled-crushmap.new -o /tmp/compiled-crushmap.new ceph osd setcrushmap -i /tmp/compiled-crushmap.new Those are the only extra commands I run for a single node cluster. Otherwise, it looks the same as my production nodes that run mon, osd, and rgw. Here's my single node's ceph.conf: [global] fsid = a7798848-1d31-421b-8f3c-5a34d60f6579 mon initial members = test0-ceph0 mon host = 172.16.205.143:6789 auth client required = none auth cluster required = none auth service required = none mon warn on legacy crush tunables = false osd crush chooseleaf type = 0 osd pool default flag hashpspool = true osd pool default min size = 1 osd pool default size = 1 public network = 172.16.205.0/24 [osd] osd journal size = 1000 osd mkfs options xfs = -s size=4096 osd mkfs type = xfs osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64 osd_scrub_sleep = 1.0 osd_snap_trim_sleep = 1.0 [client.radosgw.test0-ceph0] host = test0-ceph0 rgw socket path = /var/run/ceph/radosgw.test0-ceph0 keyring = /etc/ceph/ceph.client.radosgw.test0-ceph0.keyring log file = /var/log/ceph/radosgw.log admin socket = /var/run/ceph/radosgw.asok rgw dns name = test0-ceph rgw region = us rgw region root pool = .us.rgw.root rgw zone = us-west rgw zone root pool = .us-west.rgw.root On Thu, Dec 18, 2014 at 11:23 PM, Debashish Das deba@gmail.com wrote: Hi Team, Thank for the insight the replies, as I understood from the mails - running Ceph cluster in a single node is possible but definitely not recommended. The challenge which i see is there is no clear documentation for single node installation. So I would request if anyone has installed Ceph in single node, please share the link or document which i can refer to install Ceph in my local server. Again thanks guys !! Kind Regards Debashish Das On Fri, Dec 19, 2014 at 6:08 AM, Robert LeBlanc rob...@leblancnet.us wrote: Thanks, I'll look into these. On Thu, Dec 18, 2014 at 5:12 PM, Craig Lewis cle...@centraldesktop.com wrote: I think this is it: https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 You can also check out a presentation on Cern's Ceph cluster: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern At large scale, the biggest problem will likely be network I/O on the inter-switch links. On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real
Re: [ceph-users] Cluster unusable
Hi, I got a recommendation From Stephan to restart the OSDs one by one. So I did it. It helped a bit (some IOs completed), but at the end, the state was the same as before, and new IOs still hung. Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game. Actually this was done by simply restarting ceph on that node: [root@qvitblhat12 ~]# date;service ceph status Tue Dec 23 14:36:11 UTC 2014 === osd.0 === osd.0: running {version:0.80.7} === osd.4 === osd.4: running {version:0.80.7} [root@qvitblhat12 ~]# date;service ceph restart Tue Dec 23 14:36:17 UTC 2014 === osd.0 === === osd.0 === Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done === osd.0 === create-or-move updating item name 'osd.0' weight 0.27 at location {host=qvitblhat12,root=default} to crush map Starting Ceph osd.0 on qvitblhat12... Running as unit run-4398.service. === osd.4 === === osd.4 === Stopping Ceph osd.4 on qvitblhat12...kill 5375...done === osd.4 === create-or-move updating item name 'osd.4' weight 0.27 at location {host=qvitblhat12,root=default} to crush map Starting Ceph osd.4 on qvitblhat12... Running as unit run-4720.service. [root@qvitblhat06 ~]# ceph osd tree # idweighttype nameup/downreweight -11.62root default -51.08datacenter dc_XAT -20.54host qvitblhat10 10.27osd.1up1 50.27osd.5up1 -40.54host qvitblhat12 00.27osd.0up1 40.27osd.4up1 -60.54datacenter dc_QVI -30.54host qvitblhat11 20.27osd.2up1 30.27osd.3up1 [root@qvitblhat06 ~]# This change made ceph to rebalance data, and then the miracle, as all PGs ended up as active+clean. [root@qvitblhat06 ~]# ceph health detail HEALTH_WARN noscrub,nodeep-scrub flag(s) set noscrub,nodeep-scrub flag(s) set Well apart from being happy that the cluster is now healthy, I find it a little bit scary of having to shake it in one direction and another and hope that it will eventually recover, while in the meantime my users IOs are stuck... So is there a way to understand what happened ? Francois___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of a cluster with full OSD(s)
On Tue, Dec 23, 2014 at 3:34 AM, Max Power mailli...@ferienwohnung-altenbeken.de wrote: I understand that the status osd full should never be reached. As I am new to ceph I want to be prepared for this case. I tried two different scenarios and here are my experiences: For a real cluster, you should be monitoring your cluster, and taking immediate action once you get an OSD in nearfull state. Waiting until OSDs are toofull is too late. For a test cluster, it's a great learning experience. :-) The first one is to completely fill the storage (for me: writing files to a rados blockdevice). I discovered that the writing client (dd for example) gets completly stucked then. And this prevents me from stoping the process (SIGTERM, SIGKILL). At the moment I restart the whole computer to prevent writing to the cluster. Then I unmap the rbd device and set the full ratio a bit higher (0.95 to 0.97). I do a mount on my adminnode and delete files till everything is okay again. Is this the best practice? It is a design feature of Ceph that all cluster reads and writes stop until the toofull situation is resolved. The route you took is one of two ways to recover. The other route you found in your replica test. Is it possible to prevent the system from running in a osd full state? I could make the block devices smaller than the cluster can save. But it's hard to calculate this exactly. If you continue to add data to the cluster after it's nearfull, then you're going to hit toofull. Once you hit nearfull, you need to delete existing data, or add more OSDs. You've probably noticed that some OSDs are using more space than others. You can try to even them out with `ceph osd reweight` or `ceph osd crush reweight`, but that's a delaying tactic. When I hit nearfull, I place an order for new hardware, then use `ceph osd reweight` until it arrives. The next scenario is to change a pool size from say 2 to 3 replicas. While the cluster copies the objects it gets stuck as an osd reaches it limit. Normally the osd process quits then and I cannot restart it (even after setting the replicas back). The only possibility is to manually delete complete PG folders after exploring them with 'pg dump'. Is this the only way to get it back working again? There are some other configs that might have come into play here. You might have run into osd_failsafe_nearfull_ratio or osd_failsafe_full_ratio. You could try bumping those up a bit, and see if that lets the process stay up long enough to start reducing replicas. Since osd_failsafe_full_ratio is already 0.97, I wouldn't take it any higher than 0.98. Ceph triggers on greater-than percentages, so 0.99 will let you fill a disk to 100% full. If you get a disk to 100% full, the only way to cleanup is to start deleting PG directories. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Any Good Ceph Web Interfaces?
Are you asking because you want to manage a Ceph cluster point and click? Or do you need some shiny to show the boss? I'm using a combination of Chef and Zabbix. I'm not running RHEL though, but I would assume those are available in the repos. It's not as slick as Calamari, and it really doesn't give me a whole cluster view. Ganglia did a better job of that, but I went with Zabbix for the graphing and alerting in a single product. If you're looking for some shiny for the boss, Zabbix's web interface should work fine. If you're looking for a point and click way to build a Ceph cluster, I think Calamari is your only option. On Mon, Dec 22, 2014 at 4:11 PM, Tony unix...@gmail.com wrote: Please don't mention calamari :-) The best web interface for ceph that actually works with RHEL6.6 Preferable something in repo and controls and monitors all other ceph osd, mon, etc. Take everything and live for the moment. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Any Good Ceph Web Interfaces?
Again it depends on what you want to do. I started to evaluate VSM - it's from intel, and it's what the fujitsu uses in the eternus cd1 - but it didn't work for me. https://01.org/virtual-storage-manager It didn't work for me, because it wants to completely manage all the cluster, starting from scratch - I have puppet; and it's targetted at the CentOS crowd - I use ubuntu. On Tue, Dec 23, 2014 at 8:05 PM, Craig Lewis cle...@centraldesktop.com wrote: Are you asking because you want to manage a Ceph cluster point and click? Or do you need some shiny to show the boss? I'm using a combination of Chef and Zabbix. I'm not running RHEL though, but I would assume those are available in the repos. It's not as slick as Calamari, and it really doesn't give me a whole cluster view. Ganglia did a better job of that, but I went with Zabbix for the graphing and alerting in a single product. If you're looking for some shiny for the boss, Zabbix's web interface should work fine. If you're looking for a point and click way to build a Ceph cluster, I think Calamari is your only option. On Mon, Dec 22, 2014 at 4:11 PM, Tony unix...@gmail.com wrote: Please don't mention calamari :-) The best web interface for ceph that actually works with RHEL6.6 Preferable something in repo and controls and monitors all other ceph osd, mon, etc. Take everything and live for the moment. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Archives haven't been updated since Dec 8?
I was trying to link a colleague to a message on the mailing list, and noticed the archives haven't been rebuilt since Dec 8: http://lists.ceph.com/pipermail/ceph-users-ceph.com/ Did something break there? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph on ArmHF Ubuntu 14.4LTS?
Hi Chris, Would you care to name the vendor and hw config? I.e. x Arm Cores to y Disks/SSDs? Thanks --phil On 23 Dec 2014, at 07:10, Christopher Kunz chrisl...@de-punkt.de wrote: Am 22.12.14 um 16:10 schrieb Gregory Farnum: On Sun, Dec 21, 2014 at 11:54 PM, Christopher Kunz chrisl...@de-punkt.de wrote: Hi all, I'm trying to get a working PoC installation of Ceph done on an armhf platform. I'm failing to find working Ceph packages (so does ceph-deploy, too) for Ubuntu Trusty LTS. The ceph.com repos don't have anything besides ceph-deploy and radosgw-agent, and there are no packages in the ubuntu repos, either. What am I missing here? I don't believe we build arm packages upstream right now. Debian does, but I'm not sure about Ubuntu. We have done so in the past on a dev level (never official release packages), so if this is something you're interested in it should be pretty simple to home-brew them. :) -Greg Hi, in fact there seem to be packages in some openstack repo - I received a repository list from the arm server vendor (who happens to advertise Ceph compatibility, so kind of has to deliver :) ) and am now running a Giant cluster on 6 ARMv7 nodes. The performance is... uh, let's say, interesting ;) Thanks anyway! Regards, --ck ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
I am trying to understand these drive throttle markers that were mentioned to get an idea of why these drives are marked as slow.:: here is the iostat of the drive /dev/sdbm http://paste.ubuntu.com/9607168/ an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems really high. Looking at the ops in flight:: http://paste.ubuntu.com/9607253/ If we check against all of the osds on this node, this seems strange:: http://paste.ubuntu.com/9607331/ I do not understand why this node has ops in flight while the the remainder seem to be performing without issue. The load on the node is pretty light as well with an average CPU at 16 and an average iowait of .79:: --- /var/run/ceph# iostat -xm /dev/sdbm Linux 3.13.0-40-generic (kh10-4) 12/23/2014 _x86_64_(40 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.940.00 23.300.790.00 71.97 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdbm 0.09 0.255.033.42 0.55 0.63 288.02 0.09 10.562.55 22.32 2.54 2.15 --- I am still trying to understand the osd throttle perfdump so if anyone can help shed some light on this that would be rad. From what I can tell from the perfdump 4 osds (the last one, 228, being the slow one currently). I ended up pulling .228 from the cluster and I have yet to see another slow/blocked osd in the output of ceph -s. It is still rebuilding as I just pulled .228 out but I am still getting at least 200MB/s via bonnie while the rebuild is occurring. Finally, if this helps anyone. Although one 1x1Gb takes around 2.0 - 2.5 minutes. If we split a 10 file into 100 x 100MB we get a completion time of about 1 minute. Which would be a 10G file in about 1-1.5 minutes or 166.66MB/s versus the 8MB/s I was getting before with sequential uploads. All of these are coming from a single client via boto. This leads me to think that this is a radosgw issue specifically. This again makes me think that this is not a slow disk issue but an overall radosgw issue. If this were structural in anyway I would think that all of rados/cephs faculties would be hit and the 8MBps limit per client would be due to client throttling due to a ceiling being hit. As it turns out I am not hitting the ceiling but some other aspect of the radosgw or boto is limiting my throughput. Is this logic not correct? I feel like I am missing something. Thanks for the help everyone! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
On Mon, 2014-12-22 at 15:26 -0800, Craig Lewis wrote: My problems were memory pressure plus an XFS bug, so it took a while to manifest. The following (long, ongoing) thread on linux-mm discusses our [severe] problems with memory pressure taking out entire OSD servers. The upstream problems are still unresolved as at Linux 3.18, but anyone running Ceph on XFS over especially Infiniband or *anything* that does custom allocation in the kernel should probably be aware of this. http://marc.info/?l=linux-mmm=141605213522925w=2 AfC Sydney -- Andrew Frederick Cowie Head of Engineering Anchor Systems afcowie anchor hosting ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Any Good Ceph Web Interfaces?
Hi, for monitoring only I use the Ceph Dashboard https://github.com/Crapworks/ceph-dash/ Fo me it's an nice tool for an good overview - for administration i use the cli. Udo On 23.12.2014 01:11, Tony wrote: Please don't mention calamari :-) The best web interface for ceph that actually works with RHEL6.6 Preferable something in repo and controls and monitors all other ceph osd, mon, etc. Take everything and live for the moment. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com