Re: [ceph-users] xenserver or xen ceph
Hi. There is a solution for Ceph in XenServer. With the help of my engineer Mark, we developed a simple patch which allows you to search and attach RBD image on XenServer. We create LVHD over the RBD (not RBD per VDI mapping yet), so it is far from ideal, but its a good start. The process of creating the SR over RBD works even from XenCenter. https://github.com/mstarikov/rbdsr Install notes are included and its very simple. Takes you few minutes per XenServer. We have been running this in our Sydney Citrix lab for sometime and I have been running this at home also. Works great. For the future, the patch should work in the upcoming version of XenServer (Dundee) as well. Also we are trying to push native Ceph packages in the new version and build experimental (not official or approved yet) version of smapi which would allow us to map RBD per VDI. But there are no details on this. Anyway, everyone is welcome to participate in improving the patch on github. Let me know if you have any questions. Cheers, Jiri On 16/02/2016 15:30, Christian Balzer wrote: On Tue, 16 Feb 2016 11:52:17 +0800 (CST) maoqi1982 wrote: Hi lists Is there any solution or documents that ceph as xenserver or xen backend storage? Not really. There was a project to natively support Ceph (RBD) in Xenserver but that seems to have gone nowhere. There was also a thread last year here "RBD hard crash on kernel 3.10" (google for it) wher Shawn Edwards was working on something similar, but that seems to have died off silently as well. While you could of course do a NFS (some pains) or iSCSI (major pains) head for Ceph the pains and reduced performance make it not an attractive proposition. Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with journal on another drive
Thanks to all for responses. Great thread with a lot of info. I will go with the 3 partitions on Kingstone SDD for 3 OSDs on each node. Thanks Jiri On 30/09/2015 00:38, Lionel Bouton wrote: Hi, Le 29/09/2015 13:32, Jiri Kanicky a écrit : Hi Lionel. Thank you for your reply. In this case I am considering to create separate partitions for each disk on the SSD drive. Would be good to know what is the performance difference, because creating partitions is kind of waste of space. The difference is hard to guess : filesystems need more CPU power than raw block devices for example, so if you don't have much CPU power this can make a significant difference. Filesystems might put more load on our storage too (for example ext3/4 with data=journal will at least double the disk writes). So there's a lot to consider and nothing will be faster for journals than a raw partition. LVM logical volumes come a close second behind because usually (if you simply use LVM to create your logical volumes and don't try to use anything else like snapshots) they don't change access patterns and almost don't need any CPU power. One more question, is it a good idea to move journal for 3 OSDs to a single SSD considering if SSD fails the whole node with 3 HDDs will be down? If your SSDs are working well with Ceph and aren't cheap models dying under heavy writes, yes. I use one 200GB DC3710 SSD for 6 7200rpm SATA OSDs (using 60GB of it for the 6 journals) and it works very well (they were a huge performance boost compared to our previous use of internal journals). Some SSDs are slower than HDDs for Ceph journals though (there has been a lot of discussions on this subject on this mailing list). Thinking of it, leaving journal on each OSD might be safer, because journal on one disk does not affect other disks (OSDs). Or do you think that having the journal on SSD is better trade off? You will put significantly more stress on your HDD leaving journal on them and good SSDs are far more robust than HDDs so if you pick Intel DC or equivalent SSD for journal your infrastructure might even be more robust than one using internal journals (HDDs are dropping like flies when you have hundreds of them). There are other components able to take down all your OSDs : the disk controller, the CPU, the memory, the power supply, ... So adding one robust SSD shouldn't change the overall availabilty much (you must check their wear level and choose the models according to the amount of writes you want them to support over their lifetime though). The main reason for journals on SSD is performance anyway. If your setup is already fast enough without them, I wouldn't try to add SSDs. Otherwise, if you can't reach the level of performance needed by adding the OSDs already needed for your storage capacity objectives, go SSD. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with journal on another drive
Hi Lionel. Thank you for your reply. In this case I am considering to create separate partitions for each disk on the SSD drive. Would be good to know what is the performance difference, because creating partitions is kind of waste of space. One more question, is it a good idea to move journal for 3 OSDs to a single SSD considering if SSD fails the whole node with 3 HDDs will be down? Thinking of it, leaving journal on each OSD might be safer, because journal on one disk does not affect other disks (OSDs). Or do you think that having the journal on SSD is better trade off? Thank you Jiri On 29/09/2015 21:10, Lionel Bouton wrote: Le 29/09/2015 07:29, Jiri Kanicky a écrit : Hi, Is it possible to create journal in directory as explained here: http://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#Add.2Fmove_journal_in_running_cluster Yes, the general idea (stop, flush, move, update ceph.conf, mkjournal, start) is valid for moving your journal wherever you want. That said it probably won't perform as well on a filesystem (LVM as lower overhead than a filesystem). 1. Create BTRFS over /dev/sda6 (assuming this is SSD partition alocate for journal) and mount it to /srv/ceph/journal BTRFS is probably the worst idea for hosting journals. If you must use BTRFS, you'll have to make sure that the journals are created NoCoW before the first byte is ever written to them. 2. Add OSD: ceph-deploy osd create --fs-type btrfs ceph1:sdb:/srv/ceph/journal/osd$id/journal I've no experience with ceph-deploy... Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD nodes in XenServer VMs
Hi. As we would like to use the CEPH storage with CloudStack/XS we have to use NFS or iSCSI client nodes to provide shared storage. To avoid having several nodes of physical hardware we thought that we could run NFS/iSCSI client node on the same box with Ceph OSD node. Possibly we could even run ceparate VMs for MONs on the same hypervisor. This would give us flexibility and we could migrate NFS/iSCSI or MONs VMs to any hosts we want any time. Also we could take snapshot of the Ceph OSD VMs during upgrades. If something does not go well, we can roll back fast. Potentially, we could turn every XS local storage to Ceph OSD (utilize the unused local HDDs). I think the I/O performance tax from VM to raw local disk is negligible in comparison with physical box. Anyway, these are just thoughts. Might not be the best idea, but that is the reason why I am asking. Thx Jiri On 21/08/2015 10:12, Steven McDonald wrote: Hi Jiri, On Thu, 20 Aug 2015 11:55:55 +1000 Jiri Kanicky wrote: We are experimenting with an idea to run OSD nodes in XenServer VMs. We believe this could provide better flexibility, backups for the nodes etc. Could you expand on this? As written, it seems like a bad idea to me, just because you'd be adding complexity for no gain. Can you explain, for instance, why you think it would enable better flexibility, or why it would help with backups? What is it that you intend to back up? Backing up the OS on a storage node should never be necessary, since it should be recreatable from config management, and backing up data on the OSDs is best done on a per-pool basis because the requirements are going to differ by pool and not by OSD. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph OSD nodes in XenServer VMs
Hi all, We are experimenting with an idea to run OSD nodes in XenServer VMs. We believe this could provide better flexibility, backups for the nodes etc. For example: Xenserver with 4 HDDs dedicated for Ceph. We would introduce 1 VM (OSD node) with raw/direct access to 4 HDDs or 2 VMs (2 OSD nodes) with 2 HDDs each. Do you have any experience with this? Any thoughts on this? Good or bad idea? Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mount error: ceph filesystem not supported by the system
Hi, I can answer this myself. It was a kernel. After upgrade to lates Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) x86_64 GNU/Linux. Everything started to work as normal. Thanks :) On 6/08/2015 22:38, Jiri Kanicky wrote: Hi, I am trying to mount my CephFS and getting the following message. It was all working previously, but after power failure I am not able to mount it anymore (Debian Jessie). cephadmin@maverick:/etc/ceph$ sudo mount -t ceph ceph1.allsupp.corp,ceph2.allsupp.corp:6789:/ /mnt/cephdata/ -o name=admin,secretfile=/etc/ceph/admin.secret modprobe: ERROR: could not insert 'ceph': Unknown symbol in module, or unknown parameter (see dmesg) failed to load ceph kernel module (1) mount error: ceph filesystem not supported by the system Ceph seems to be healthy (ignore the PGs). cephadmin@ceph1:~$ ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_WARN too many PGs per OSD (384 > max 300) monmap e2: 3 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0,ceph3=192.168.30.23:6789/0} election epoch 100, quorum 0,1,2 ceph1,ceph2,ceph3 mdsmap e98: 1/1/1 up {0=ceph1=up:active}, 1 up:standby osdmap e773: 4 osds: 4 up, 4 in pgmap v457296: 768 pgs, 3 pools, 2020 GB data, 574 kobjects 4804 GB used, 6350 GB / 11158 GB avail 768 active+clean Any ideas? Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mount error: ceph filesystem not supported by the system
Hi, I am trying to mount my CephFS and getting the following message. It was all working previously, but after power failure I am not able to mount it anymore (Debian Jessie). cephadmin@maverick:/etc/ceph$ sudo mount -t ceph ceph1.allsupp.corp,ceph2.allsupp.corp:6789:/ /mnt/cephdata/ -o name=admin,secretfile=/etc/ceph/admin.secret modprobe: ERROR: could not insert 'ceph': Unknown symbol in module, or unknown parameter (see dmesg) failed to load ceph kernel module (1) mount error: ceph filesystem not supported by the system Ceph seems to be healthy (ignore the PGs). cephadmin@ceph1:~$ ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_WARN too many PGs per OSD (384 > max 300) monmap e2: 3 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0,ceph3=192.168.30.23:6789/0} election epoch 100, quorum 0,1,2 ceph1,ceph2,ceph3 mdsmap e98: 1/1/1 up {0=ceph1=up:active}, 1 up:standby osdmap e773: 4 osds: 4 up, 4 in pgmap v457296: 768 pgs, 3 pools, 2020 GB data, 574 kobjects 4804 GB used, 6350 GB / 11158 GB avail 768 active+clean Any ideas? Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph config files
Hi, I just added new monitor (MON). "$ ceph status" shows the monitor in the quorum, but the new monitor is not shown in /etc/ceph/ceph.conf. I am wondering what role the /etc/ceph/ceph.conf plays? Do I need to manually edit the file on each node and add the monitors? In addition, there are almost the same config files in my ceph deployment directory. But they can be different from what is in /etc/ceph/ceph.conf. What role these files play? Please can someone explain me the function of ceph config files? Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node
Hi, BTW, is there a way how to achieve redundancy over multiple OSDs in one box by changing CRUSH map? Thank you Jiri On 20/01/2015 13:37, Jiri Kanicky wrote: Hi, Thanks for the reply. That clarifies it. I thought that the redundancy can be achieved with multiple OSDs (like multiple disks in RAID) in case you don't have more nodes. Obviously the single point of failure would be the box. My current setting is: osd_pool_default_size = 2 Thank you Jiri On 20/01/2015 13:13, Lindsay Mathieson wrote: You only have one osd node (ceph4). The default replication requirements for your pools (size = 3) require osd's spread over three nodes, so the data can be replicate on three different nodes. That will be why your pgs are degraded. You need to either add mode osd nodes or reduce your size setting down to the number of osd nodes you have. Setting your size to 1 would be a bad idea, there would be no redundancy in your data at all. Loosing one disk would destroy all your data. The command to see you pool size is: sudo ceph osd pool get size assuming default setup: ceph osd pool get rbd size returns: 3 On 20 January 2015 at 10:51, Jiri Kanicky <mailto:j...@ganomi.com>> wrote: Hi, I just would like to clarify if I should expect degraded PGs with 11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) nodes allows me to have healthy cluster. $ sudo ceph osd pool create test 512 pool 'test' created $ sudo ceph status cluster 4e77327a-118d-450d-ab69-455df6458cd4 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 pgs undersized monmap e1: 3 mons at {ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0 <http://172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0>}, election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e190: 11 osds: 11 up, 11 in pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects 53724 kB used, 9709 GB / 9720 GB avail 512 active+undersized+degraded $ sudo ceph osd tree # idweight type name up/down reweight -1 9.45root default -2 9.45host ceph4 0 0.45osd.0 up 1 1 0.9 osd.1 up 1 2 0.9 osd.2 up 1 3 0.9 osd.3 up 1 4 0.9 osd.4 up 1 5 0.9 osd.5 up 1 6 0.9 osd.6 up 1 7 0.9 osd.7 up 1 8 0.9 osd.8 up 1 9 0.9 osd.9 up 1 10 0.9 osd.10 up 1 Thank you, Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node
Hi, Thanks for the reply. That clarifies it. I thought that the redundancy can be achieved with multiple OSDs (like multiple disks in RAID) in case you don't have more nodes. Obviously the single point of failure would be the box. My current setting is: osd_pool_default_size = 2 Thank you Jiri On 20/01/2015 13:13, Lindsay Mathieson wrote: You only have one osd node (ceph4). The default replication requirements for your pools (size = 3) require osd's spread over three nodes, so the data can be replicate on three different nodes. That will be why your pgs are degraded. You need to either add mode osd nodes or reduce your size setting down to the number of osd nodes you have. Setting your size to 1 would be a bad idea, there would be no redundancy in your data at all. Loosing one disk would destroy all your data. The command to see you pool size is: sudo ceph osd pool get size assuming default setup: ceph osd pool get rbd size returns: 3 On 20 January 2015 at 10:51, Jiri Kanicky <mailto:j...@ganomi.com>> wrote: Hi, I just would like to clarify if I should expect degraded PGs with 11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) nodes allows me to have healthy cluster. $ sudo ceph osd pool create test 512 pool 'test' created $ sudo ceph status cluster 4e77327a-118d-450d-ab69-455df6458cd4 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 pgs undersized monmap e1: 3 mons at {ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0 <http://172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0>}, election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e190: 11 osds: 11 up, 11 in pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects 53724 kB used, 9709 GB / 9720 GB avail 512 active+undersized+degraded $ sudo ceph osd tree # idweight type name up/down reweight -1 9.45root default -2 9.45host ceph4 0 0.45osd.0 up 1 1 0.9 osd.1 up 1 2 0.9 osd.2 up 1 3 0.9 osd.3 up 1 4 0.9 osd.4 up 1 5 0.9 osd.5 up 1 6 0.9 osd.6 up 1 7 0.9 osd.7 up 1 8 0.9 osd.8 up 1 9 0.9 osd.9 up 1 10 0.9 osd.10 up 1 Thank you, Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node
Hi. I am just curious. This is just lab environment and we are short on hardware :). We will have more hardware later, but right now this is all I have. Monitors are VMs. Anyway, we will have to survive with this somehow :). Thanks Jiri On 20/01/2015 15:33, Lindsay Mathieson wrote: On 20 January 2015 at 14:10, Jiri Kanicky <mailto:j...@ganomi.com>> wrote: Hi, BTW, is there a way how to achieve redundancy over multiple OSDs in one box by changing CRUSH map? I asked that same question myself a few weeks back :) The answer was yes - but fiddly and why would you do that? Its kinda breaking the purpose of ceph, which is large amounts of data stored redundantly over multiple nodes. Perhaps you should re-examine your requirements. If what you want is data redundantly stored on hard disks on one node, perhaps you would be better served by creating a ZFS raid setup. With just one node it would be easier and more flexible - better performance as well. Alternatively, could you put some OSD's on your monitor ndoes? what spec are they? Thank you Jiri On 20/01/2015 13:37, Jiri Kanicky wrote: Hi, Thanks for the reply. That clarifies it. I thought that the redundancy can be achieved with multiple OSDs (like multiple disks in RAID) in case you don't have more nodes. Obviously the single point of failure would be the box. My current setting is: osd_pool_default_size = 2 Thank you Jiri On 20/01/2015 13:13, Lindsay Mathieson wrote: You only have one osd node (ceph4). The default replication requirements for your pools (size = 3) require osd's spread over three nodes, so the data can be replicate on three different nodes. That will be why your pgs are degraded. You need to either add mode osd nodes or reduce your size setting down to the number of osd nodes you have. Setting your size to 1 would be a bad idea, there would be no redundancy in your data at all. Loosing one disk would destroy all your data. The command to see you pool size is: sudo ceph osd pool get size assuming default setup: ceph osd pool get rbd size returns: 3 On 20 January 2015 at 10:51, Jiri Kanicky mailto:j...@ganomi.com>> wrote: Hi, I just would like to clarify if I should expect degraded PGs with 11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) nodes allows me to have healthy cluster. $ sudo ceph osd pool create test 512 pool 'test' created $ sudo ceph status cluster 4e77327a-118d-450d-ab69-455df6458cd4 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 pgs undersized monmap e1: 3 mons at {ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0 <http://172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0>}, election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e190: 11 osds: 11 up, 11 in pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects 53724 kB used, 9709 GB / 9720 GB avail 512 active+undersized+degraded $ sudo ceph osd tree # idweight type name up/down reweight -1 9.45root default -2 9.45host ceph4 0 0.45osd.0 up 1 1 0.9 osd.1 up 1 2 0.9 osd.2 up 1 3 0.9 osd.3 up 1 4 0.9 osd.4 up 1 5 0.9 osd.5 up 1 6 0.9 osd.6 up 1 7 0.9 osd.7 up 1 8 0.9 osd.8 up 1 9 0.9 osd.9 up 1 10 0.9 osd.10 up 1 Thank you, Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Lindsay -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PGs degraded with 3 MONs and 1 OSD node
Hi, I just would like to clarify if I should expect degraded PGs with 11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) nodes allows me to have healthy cluster. $ sudo ceph osd pool create test 512 pool 'test' created $ sudo ceph status cluster 4e77327a-118d-450d-ab69-455df6458cd4 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 pgs undersized monmap e1: 3 mons at {ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0}, election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e190: 11 osds: 11 up, 11 in pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects 53724 kB used, 9709 GB / 9720 GB avail 512 active+undersized+degraded $ sudo ceph osd tree # idweight type name up/down reweight -1 9.45root default -2 9.45host ceph4 0 0.45osd.0 up 1 1 0.9 osd.1 up 1 2 0.9 osd.2 up 1 3 0.9 osd.3 up 1 4 0.9 osd.4 up 1 5 0.9 osd.5 up 1 6 0.9 osd.6 up 1 7 0.9 osd.7 up 1 8 0.9 osd.8 up 1 9 0.9 osd.9 up 1 10 0.9 osd.10 up 1 Thank you, Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant on Centos 7 with custom cluster name
Hi, I have upgraded Firefly to Giant on Debian Wheezy and it went without any problems. Jiri On 16/01/2015 06:49, Erik McCormick wrote: Hello all, I've got an existing Firefly cluster on Centos 7 which I deployed with ceph-deploy. In the latest version of ceph-deploy, it refuses to handle commands issued with a cluster name. [ceph_deploy.install][ERROR ] custom cluster names are not supported on sysvinit hosts This is a production cluster. Small, but still production. Is it safe to go through manually upgrading the packages? I'd hate to do the upgrade and find out I can no longer start the cluster because it can't be called anything other than "ceph". Thanks, Erik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH Expansion
Hi George, List disks available: # $ ceph-deploy disk list {node-name [node-name]...} Add OSD using osd create: # $ ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}] Or you can use the manual steps to prepare and activate disk described at http://ceph.com/docs/master/start/quick-ceph-deploy/#expanding-your-cluster Jiri On 15/01/2015 06:36, Georgios Dimitrakakis wrote: Hi all! I would like to expand our CEPH Cluster and add a second OSD node. In this node I will have ten 4TB disks dedicated to CEPH. What is the proper way of putting them in the already available CEPH node? I guess that the first thing to do is to prepare them with ceph-deploy and mark them as out at preparation. I should then restart the services and add (mark as in) one of them. Afterwards, I have to wait for the rebalance to occur and upon finishing I will add the second and so on. Is this safe enough? How long do you expect the rebalancing procedure to take? I already have ten more 4TB disks at another node and the amount of data is around 40GB with 2x replication factor. The connection is over Gigabit. Best, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Hi Nico, I would probably recommend to upgrade to 0.87 (giant). I am running this version for some time now and it works very well. I also upgraded from firefly and it was easy. The issue you are experiencing seems quite complex and it would require debug logs to troubleshoot. Apology that I did not help much. -Jiri On 9/01/2015 20:23, Nico Schottelius wrote: Good morning Jiri, sure, let me catch up on this: - Kernel 3.16 - ceph: 0.80.7 - fs: xfs - os: debian (backports) (1x)/ubuntu (2x) Cheers, Nico Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]: Hi Nico. If you are experiencing such issues it would be good if you provide more info about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs. Thx Jiri - Reply message - From: "Nico Schottelius" To: Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable] Date: Wed, Dec 31, 2014 02:36 Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Reading about the issues on the list here gives me the impression that ceph as a software is stuck/incomplete and has not yet become ready "clean" for production (sorry for the word joke). Cheers, Nico Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Hi Nico. If you are experiencing such issues it would be good if you provide more info about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs. Thx Jiri - Reply message - From: "Nico Schottelius" To: Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable] Date: Wed, Dec 31, 2014 02:36 Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Reading about the issues on the list here gives me the impression that ceph as a software is stuck/incomplete and has not yet become ready "clean" for production (sorry for the word joke). Cheers, Nico Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]: > Hi Nico and all others who answered, > > After some more trying to somehow get the pgs in a working state (I've > tried force_create_pg, which was putting then in creating state. But > that was obviously not true, since after rebooting one of the containing > osd's it went back to incomplete), I decided to save what can be saved. > > I've created a new pool, created a new image there, mapped the old image > from the old pool and the new image from the new pool to a machine, to > copy data on posix level. > > Unfortunately, formatting the image from the new pool hangs after some > time. So it seems that the new pool is suffering from the same problem > as the old pool. Which is totaly not understandable for me. > > Right now, it seems like Ceph is giving me no options to either save > some of the still intact rbd volumes, or to create a new pool along the > old one to at least enable our clients to send data to ceph again. > > To tell the truth, I guess that will result in the end of our ceph > project (running for already 9 Monthes). > > Regards, > Christian > > Am 29.12.2014 15:59, schrieb Nico Schottelius: > > Hey Christian, > > > > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: > >> [incomplete PG / RBD hanging, osd lost also not helping] > > > > that is very interesting to hear, because we had a similar situation > > with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg > > directories to allow OSDs to start after the disk filled up completly. > > > > So I am sorry not to being able to give you a good hint, but I am very > > interested in seeing your problem solved, as it is a show stopper for > > us, too. (*) > > > > Cheers, > > > > Nico > > > > (*) We migrated from sheepdog to gluster to ceph and so far sheepdog > > seems to run much smoother. The first one is however not supported > > by opennebula directly, the second one not flexible enough to host > > our heterogeneous infrastructure (mixed disk sizes/amounts) - so we > > are using ceph at the moment. > > > > > -- > Christian Eichelmann > Systemadministrator > > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting > Brauerstraße 48 · DE-76135 Karlsruhe > Telefon: +49 721 91374-8026 > christian.eichelm...@1und1.de > > Amtsgericht Montabaur / HRB 6484 > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen > Aufsichtsratsvorsitzender: Michael Scheeren -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs usable or not?
Hi Max, Thanks for this info. I am planing to use CephFS (ceph version 0.87) at home, because its more convenient than NFS over RBD. I dont have large environment; about 20TB, so hopefully it will hold. I backup all important data just in case. :) Thank you. Jiri On 29/12/2014 21:09, Thomas Lemarchand wrote: Hi Max, I do use CephFS (Giant) in a production environment. It works really well, but I have backups ready to use, just in case. As Wido said, kernel version is not really relevant if you use ceph-fuse (which I recommend over cephfs kernel, for stability and ease of upgrade reasons). However, I found ceph-mds memory usage hard to predict, and I had some problems with that. At first it was undersized (16GB, for ~8M files / dirs, and 1M inodes cached), but it worked well until I had a server crash who did not recover (mds rejoin / rebuild) because of the lack of memory. So I gave it 24GB memory + 24GB swap, no problem anymore. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pg repair unsuccessful
Hi, I have been experiencing issues with several PGs which remained in inconsistent state (I use BTRFS). "ceph pg repair" is not able to repair them. The only way I can delete the corresponding file, which is causing the issue (see logs bellow) from the OSDs. This however means loss of data. Is there any other way how to fix it? $ ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 2.17 is active+clean+inconsistent, acting [1,3] 1 scrub errors Log output: 2015-01-07 21:43:13.396376 7f0c5ac53700 -1 log_channel(default) log [ERR] : repair 2.17 f2a47417/100f485./head//2 on disk size (4194304) does not match object info size (0) adjusted for ondisk to (0) 2015-01-07 21:43:56.771820 7f0c5ac53700 -1 log_channel(default) log [ERR] : 2.17 repair 1 errors, 0 fixed 2015-01-07 21:44:10.473870 7f0c5ac53700 -1 log_channel(default) log [ERR] : deep-scrub 2.17 f2a47417/100f485./head//2 on disk size (4194304) does not match object info size (0) adjusted for ondisk to (0) 2015-01-07 21:44:42.919425 7f0c5ac53700 -1 log_channel(default) log [ERR] : 2.17 deep-scrub 1 errors Thx Jiri cephver 0.87, Debian Wheezy, BTRFS ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs with btrfs are down
Hi. Do you know how to tell that the option "filestore btrfs snap = false" is set? Thx Jiri On 5/01/2015 02:25, Jiri Kanicky wrote: Hi. I have been experiencing same issues on both nodes over the past 2 days (never both nodes at the same time). It seems the issue occurs after some time when copying a large number of files to CephFS on my client node (I dont use RBD yet). These are new HP servers and the memory does not seem to have any issues in mem test. I use SSD for OS and normal drives for OSD. I think that the issue is not related to drives as it would be too much coincident to have 6 drives with bad blocks on both nodes. I will also disable the snapshots and report back after few days. Thx Jiri On 5/01/2015 01:33, Dyweni - Ceph-Users wrote: On 2015-01-04 08:21, Jiri Kanicky wrote: More googling took me to the following post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in later releases. Unfortunately even latest Linux can crash and corrupt Btrfs file system if OSDs are using snapshots (which is the default). Due to kernel bugs related to Btrfs snapshots I also lost some OSDs until I found that snapshotting can be disabled with "filestore btrfs snap = false". I am wondering if this can be the problem. Very interesting... I think I was just hit with that over night. :) Yes, I would definitely recommend turning off snapshots. I'm going to do that myself now. Have you tested the memory in your server lately? Memtest86+ on the ram, and badblocks on the SSD swap partition? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs with btrfs are down
Hi. I have been experiencing same issues on both nodes over the past 2 days (never both nodes at the same time). It seems the issue occurs after some time when copying a large number of files to CephFS on my client node (I dont use RBD yet). These are new HP servers and the memory does not seem to have any issues in mem test. I use SSD for OS and normal drives for OSD. I think that the issue is not related to drives as it would be too much coincident to have 6 drives with bad blocks on both nodes. I will also disable the snapshots and report back after few days. Thx Jiri On 5/01/2015 01:33, Dyweni - Ceph-Users wrote: On 2015-01-04 08:21, Jiri Kanicky wrote: More googling took me to the following post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in later releases. Unfortunately even latest Linux can crash and corrupt Btrfs file system if OSDs are using snapshots (which is the default). Due to kernel bugs related to Btrfs snapshots I also lost some OSDs until I found that snapshotting can be disabled with "filestore btrfs snap = false". I am wondering if this can be the problem. Very interesting... I think I was just hit with that over night. :) Yes, I would definitely recommend turning off snapshots. I'm going to do that myself now. Have you tested the memory in your server lately? Memtest86+ on the ram, and badblocks on the SSD swap partition? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs with btrfs are down
Hi. Correction. My SWAP is 3GB on SSD disk. I dont use th nodes for client stuff. Thx Jiri On 5/01/2015 01:21, Jiri Kanicky wrote: Hi, Here is my memory output. I use HP Microservers with 2GB RAM. Swap is 500MB on SSD disk. cephadmin@ceph1:~$ free total used free sharedbuffers cached Mem: 18857201817860 67860 0 32 694552 -/+ buffers/cache:1123276 762444 Swap: 3859452 6334923225960 More googling took me to the following post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in later releases. Unfortunately even latest Linux can crash and corrupt Btrfs file system if OSDs are using snapshots (which is the default). Due to kernel bugs related to Btrfs snapshots I also lost some OSDs until I found that snapshotting can be disabled with "filestore btrfs snap = false". I am wondering if this can be the problem. -Jiri On 5/01/2015 01:17, Dyweni - BTRFS wrote: Hi, BTRFS crashed because the system ran out of memory... I see these entries in your logs: Jan 4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page allocation failure: order:1, mode:0x204020 Jan 4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory Jan 4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device sdb1) in cleanup_transaction:1577: errno=-12 Out of memory How much memory do you have in this node? Where you using Ceph (as a client) on this node? Do you have swap configured on this node? On 2015-01-04 07:12, Jiri Kanicky wrote: Hi, My OSDs with btrfs are down on one node. I found the cluster in this state: cephadmin@ceph1:~$ ceph osd tree # idweight type name up/down reweight -1 10.88 root default -2 5.44host ceph1 0 2.72osd.0 down0 1 2.72osd.1 down0 -3 5.44host ceph2 2 2.72osd.2 up 1 3 2.72osd.3 up 1 cephadmin@ceph1:~$ ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs stuck undersized; 631 pgs undersized; recovery 397226/915548 objects degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub errors monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 30, quorum 0,1 ceph1,ceph2 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby osdmap e242: 4 osds: 2 up, 2 in pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects 1811 GB used, 3764 GB / 5579 GB avail 397226/915548 objects degraded (43.387%); 72026/915548 objects misplaced (7.867%) 14 active+recovering+degraded+remapped 122 active+remapped 1 active+remapped+inconsistent 603 active+undersized+degraded 28 active+undersized+degraded+inconsistent Would you know if this is pure BTRFS issue or is there any setting I forgot to use? Jan 4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page allocation failure: order:1, mode:0x204020 Jan 4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm: kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1 Jan 4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/09/2013 Jan 4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events do_async_commit [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535704] 0001 81541f8f 00204020 Jan 4 17:11:06 ceph1 kernel: [756636.535707] 811519ed 0001 880075de0c00 0002 Jan 4 17:11:06 ceph1 kernel: [756636.535710] 0001 880075de0c08 0096 Jan 4 17:11:06 ceph1 kernel: [756636.535713] Call Trace: Jan 4 17:11:06 ceph1 kernel: [756636.535720] [] ? dump_stack+0x41/0x51 Jan 4 17:11:06 ceph1 kernel: [756636.535725] [] ? warn_alloc_failed+0xfd/0x160 Jan 4 17:11:06 ceph1 kernel: [756636.535730] [] ? __alloc_pages_nodemask+0x91f/0xbb0 Jan 4 17:11:06 ceph1 kernel: [756636.535734] [] ? kmem_getpages+0x60/0x110 Jan 4 17:11:06 ceph1 kernel: [756636.535737] [] ? fallback_alloc+0x158/0x220 Jan 4 17:11:06 ceph1 kernel: [756636.535741] [] ? kmem_cache_alloc+0x1a4/0x1e0 Jan 4 17:11:06 ceph1 kernel: [756636.535745] [] ? ida_pre_get+0x60/0xd0 Jan 4 17:11:06 ceph1 kernel: [756636.535749] [] ? get_anon_bdev+0x21/0xe0 Jan 4 17:11:06 ceph1 kernel: [756636.535762] [] ? btrfs_init_fs_root+0xff/0x1b0 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535774] [] ? btrfs_read_fs_root+0x33/0x40 [btrfs]
Re: [ceph-users] OSDs with btrfs are down
Hi, Here is my memory output. I use HP Microservers with 2GB RAM. Swap is 500MB on SSD disk. cephadmin@ceph1:~$ free total used free sharedbuffers cached Mem: 18857201817860 67860 0 32 694552 -/+ buffers/cache:1123276 762444 Swap: 3859452 6334923225960 More googling took me to the following post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in later releases. Unfortunately even latest Linux can crash and corrupt Btrfs file system if OSDs are using snapshots (which is the default). Due to kernel bugs related to Btrfs snapshots I also lost some OSDs until I found that snapshotting can be disabled with "filestore btrfs snap = false". I am wondering if this can be the problem. -Jiri On 5/01/2015 01:17, Dyweni - BTRFS wrote: Hi, BTRFS crashed because the system ran out of memory... I see these entries in your logs: Jan 4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page allocation failure: order:1, mode:0x204020 Jan 4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory Jan 4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device sdb1) in cleanup_transaction:1577: errno=-12 Out of memory How much memory do you have in this node? Where you using Ceph (as a client) on this node? Do you have swap configured on this node? On 2015-01-04 07:12, Jiri Kanicky wrote: Hi, My OSDs with btrfs are down on one node. I found the cluster in this state: cephadmin@ceph1:~$ ceph osd tree # idweight type name up/down reweight -1 10.88 root default -2 5.44host ceph1 0 2.72osd.0 down0 1 2.72osd.1 down0 -3 5.44host ceph2 2 2.72osd.2 up 1 3 2.72osd.3 up 1 cephadmin@ceph1:~$ ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs stuck undersized; 631 pgs undersized; recovery 397226/915548 objects degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub errors monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 30, quorum 0,1 ceph1,ceph2 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby osdmap e242: 4 osds: 2 up, 2 in pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects 1811 GB used, 3764 GB / 5579 GB avail 397226/915548 objects degraded (43.387%); 72026/915548 objects misplaced (7.867%) 14 active+recovering+degraded+remapped 122 active+remapped 1 active+remapped+inconsistent 603 active+undersized+degraded 28 active+undersized+degraded+inconsistent Would you know if this is pure BTRFS issue or is there any setting I forgot to use? Jan 4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page allocation failure: order:1, mode:0x204020 Jan 4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm: kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1 Jan 4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/09/2013 Jan 4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events do_async_commit [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535704] 0001 81541f8f 00204020 Jan 4 17:11:06 ceph1 kernel: [756636.535707] 811519ed 0001 880075de0c00 0002 Jan 4 17:11:06 ceph1 kernel: [756636.535710] 0001 880075de0c08 0096 Jan 4 17:11:06 ceph1 kernel: [756636.535713] Call Trace: Jan 4 17:11:06 ceph1 kernel: [756636.535720] [] ? dump_stack+0x41/0x51 Jan 4 17:11:06 ceph1 kernel: [756636.535725] [] ? warn_alloc_failed+0xfd/0x160 Jan 4 17:11:06 ceph1 kernel: [756636.535730] [] ? __alloc_pages_nodemask+0x91f/0xbb0 Jan 4 17:11:06 ceph1 kernel: [756636.535734] [] ? kmem_getpages+0x60/0x110 Jan 4 17:11:06 ceph1 kernel: [756636.535737] [] ? fallback_alloc+0x158/0x220 Jan 4 17:11:06 ceph1 kernel: [756636.535741] [] ? kmem_cache_alloc+0x1a4/0x1e0 Jan 4 17:11:06 ceph1 kernel: [756636.535745] [] ? ida_pre_get+0x60/0xd0 Jan 4 17:11:06 ceph1 kernel: [756636.535749] [] ? get_anon_bdev+0x21/0xe0 Jan 4 17:11:06 ceph1 kernel: [756636.535762] [] ? btrfs_init_fs_root+0xff/0x1b0 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535774] [] ? btrfs_read_fs_root+0x33/0x40 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535785] [] ? btrfs_get_fs_root+0xd6/0x230 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756
[ceph-users] OSDs with btrfs are down
Hi, My OSDs with btrfs are down on one node. I found the cluster in this state: cephadmin@ceph1:~$ ceph osd tree # idweight type name up/down reweight -1 10.88 root default -2 5.44host ceph1 0 2.72osd.0 down0 1 2.72osd.1 down0 -3 5.44host ceph2 2 2.72osd.2 up 1 3 2.72osd.3 up 1 cephadmin@ceph1:~$ ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs stuck undersized; 631 pgs undersized; recovery 397226/915548 objects degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub errors monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 30, quorum 0,1 ceph1,ceph2 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby osdmap e242: 4 osds: 2 up, 2 in pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects 1811 GB used, 3764 GB / 5579 GB avail 397226/915548 objects degraded (43.387%); 72026/915548 objects misplaced (7.867%) 14 active+recovering+degraded+remapped 122 active+remapped 1 active+remapped+inconsistent 603 active+undersized+degraded 28 active+undersized+degraded+inconsistent Would you know if this is pure BTRFS issue or is there any setting I forgot to use? Jan 4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page allocation failure: order:1, mode:0x204020 Jan 4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm: kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1 Jan 4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/09/2013 Jan 4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events do_async_commit [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535704] 0001 81541f8f 00204020 Jan 4 17:11:06 ceph1 kernel: [756636.535707] 811519ed 0001 880075de0c00 0002 Jan 4 17:11:06 ceph1 kernel: [756636.535710] 0001 880075de0c08 0096 Jan 4 17:11:06 ceph1 kernel: [756636.535713] Call Trace: Jan 4 17:11:06 ceph1 kernel: [756636.535720] [] ? dump_stack+0x41/0x51 Jan 4 17:11:06 ceph1 kernel: [756636.535725] [] ? warn_alloc_failed+0xfd/0x160 Jan 4 17:11:06 ceph1 kernel: [756636.535730] [] ? __alloc_pages_nodemask+0x91f/0xbb0 Jan 4 17:11:06 ceph1 kernel: [756636.535734] [] ? kmem_getpages+0x60/0x110 Jan 4 17:11:06 ceph1 kernel: [756636.535737] [] ? fallback_alloc+0x158/0x220 Jan 4 17:11:06 ceph1 kernel: [756636.535741] [] ? kmem_cache_alloc+0x1a4/0x1e0 Jan 4 17:11:06 ceph1 kernel: [756636.535745] [] ? ida_pre_get+0x60/0xd0 Jan 4 17:11:06 ceph1 kernel: [756636.535749] [] ? get_anon_bdev+0x21/0xe0 Jan 4 17:11:06 ceph1 kernel: [756636.535762] [] ? btrfs_init_fs_root+0xff/0x1b0 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535774] [] ? btrfs_read_fs_root+0x33/0x40 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535785] [] ? btrfs_get_fs_root+0xd6/0x230 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535796] [] ? create_pending_snapshot+0x793/0xa00 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535807] [] ? create_pending_snapshots+0x89/0xa0 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535818] [] ? btrfs_commit_transaction+0x35a/0xa10 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535824] [] ? mod_timer+0x10e/0x220 Jan 4 17:11:06 ceph1 kernel: [756636.535834] [] ? do_async_commit+0x2a/0x40 [btrfs] Jan 4 17:11:06 ceph1 kernel: [756636.535839] [] ? process_one_work+0x15c/0x450 Jan 4 17:11:06 ceph1 kernel: [756636.535843] [] ? worker_thread+0x112/0x540 Jan 4 17:11:06 ceph1 kernel: [756636.535847] [] ? create_and_start_worker+0x60/0x60 Jan 4 17:11:06 ceph1 kernel: [756636.535851] [] ? kthread+0xc1/0xe0 Jan 4 17:11:06 ceph1 kernel: [756636.535854] [] ? flush_kthread_worker+0xb0/0xb0 Jan 4 17:11:06 ceph1 kernel: [756636.535858] [] ? ret_from_fork+0x7c/0xb0 Jan 4 17:11:06 ceph1 kernel: [756636.535861] [] ? flush_kthread_worker+0xb0/0xb0 Jan 4 17:11:06 ceph1 kernel: [756636.535863] Mem-Info: Jan 4 17:11:06 ceph1 kernel: [756636.535865] Node 0 DMA per-cpu: Jan 4 17:11:06 ceph1 kernel: [756636.535867] CPU0: hi:0, btch: 1 usd: 0 Jan 4 17:11:06 ceph1 kernel: [756636.535869] CPU1: hi:0, btch: 1 usd: 0 Jan 4 17:11:06 ceph1 kernel: [756636.535870] Node 0 DMA32 per-cpu: Jan 4 17:11:06 ceph1 kernel: [756636.535872] CPU0: hi: 186, btch: 31 usd: 216 Jan 4 17:11:06 ceph1 kernel: [756636.535874] CPU1: hi: 186, btch: 31 usd: 150 Jan 4 17:11:06 ceph1 kernel: [756636.535879] active_anon:156968 inactive_anon:52877 is
Re: [ceph-users] redundancy with 2 nodes
Hi, I noticed this message after shutting down the other node. You might be right that I need 3 monitors. 2015-01-01 15:47:35.990260 7f22858dd700 0 monclient: hunting for new mon But what is quite unexpected is that you cannot run even "ceph status" on the running node t find out the state of the cluster. Thx Jiri On 1/01/2015 15:46, Jiri Kanicky wrote: Hi, I have: - 2 monitors, one on each node - 4 OSDs, two on each node - 2 MDS, one on each node Yes, all pools are set with size=2 and min_size=1 cephadmin@ceph1:~$ ceph osd dump epoch 88 fsid bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 created 2014-12-27 23:38:00.455097 modified 2014-12-30 20:45:51.343217 flags pool 0 'rbd' replicated *size 2 min_size 1* crush_ruleset 0 object_hash rjenkins p g_num 256 pgp_num 256 last_change 86 flags hashpspool stripe_width 0 pool 1 'media' replicated *size 2 min_size 1* crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 60 flags hashpspool stripe_width 0 pool 2 'data' replicated size *2 min_size 1* crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 63 flags hashpspool stripe_width 0 pool 3 'cephfs_test' replicated *size 2 min_size 1* crush_ruleset 0 object_hash rj enkins pg_num 256 pgp_num 256 last_change 71 flags hashpspool crash_replay_inter val 45 stripe_width 0 pool 4 'cephfs_metadata' replicated *size 2 min_size 1* crush_ruleset 0 object_has h rjenkins pg_num 256 pgp_num 256 last_change 69 flags hashpspool stripe_width 0 max_osd 4 osd.0 up in weight 1 up_from 55 up_thru 86 down_at 51 last_clean_interval [39 ,50) 192.168.30.21:6800/17319 10.1.1.21:6800/17319 10.1.1.21:6801/17319 192.168. 30.21:6801/17319 exists,up 4f3172e1-adb8-4ca3-94af-6f0b8fcce35a osd.1 up in weight 1 up_from 57 up_thru 86 down_at 53 last_clean_interval [41 ,52) 192.168.30.21:6803/17684 10.1.1.21:6802/17684 10.1.1.21:6804/17684 192.168. 30.21:6805/17684 exists,up 1790347a-94fa-4b81-b429-1e7c7f11d3fd osd.2 up in weight 1 up_from 79 up_thru 86 down_at 74 last_clean_interval [13 ,73) 192.168.30.22:6801/3178 10.1.1.22:6800/3178 10.1.1.22:6801/3178 192.168.30. 22:6802/3178 exists,up 5520835f-c411-4750-974b-34e9aea2585d osd.3 up in weight 1 up_from 81 up_thru 86 down_at 72 last_clean_interval [20 ,71) 192.168.30.22:6804/3414 10.1.1.22:6802/3414 10.1.1.22:6803/3414 192.168.30. 22:6805/3414 exists,up 25e62059-6392-4a69-99c9-214ae335004 Thx Jiri On 1/01/2015 15:21, Lindsay Mathieson wrote: On Thu, 1 Jan 2015 02:59:05 PM Jiri Kanicky wrote: I would expect that if I shut down one node, the system will keep running. But when I tested it, I cannot even execute "ceph status" command on the running node. 2 osd Nodes, 3 Mon nodes here, works perfectly for me. How many monitors do you have? Maybe you need a third monitor only node for quorum? I set "osd_pool_default_size = 2" (min_size=1) on all pools, so I thought that each copy will reside on each node. Which means that if 1 node goes down the second one will be still operational. does: ceph osd pool get {pool name} size return 2 ceph osd pool get {pool name} min_size return 1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] redundancy with 2 nodes
Hi, I think you are right. I was too focused on the following line in docs: "A cluster will run fine with a single monitor; however,*a single monitor is a single-point-of-failure*." I will try to add another monitor. Hopefully, this will fix my issue. Anyway, I think that "ceph status" or "ceph health" should report at least something in such state. Its quite weird that everything stops... Thank you Jiri On 1/01/2015 15:51, Lindsay Mathieson wrote: On Thu, 1 Jan 2015 03:46:33 PM Jiri Kanicky wrote: Hi, I have: - 2 monitors, one on each node - 4 OSDs, two on each node - 2 MDS, one on each node POOMA U here, but I don't think you can reach quorum with one out of two monitors, you need a odd number: http://ceph.com/docs/master/rados/configuration/mon-config-ref/#monitor-quorum Perhaps try removing one monitor, so you only have one left, then take the node without a monitor down. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] redundancy with 2 nodes
Hi, I have: - 2 monitors, one on each node - 4 OSDs, two on each node - 2 MDS, one on each node Yes, all pools are set with size=2 and min_size=1 cephadmin@ceph1:~$ ceph osd dump epoch 88 fsid bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 created 2014-12-27 23:38:00.455097 modified 2014-12-30 20:45:51.343217 flags pool 0 'rbd' replicated *size 2 min_size 1* crush_ruleset 0 object_hash rjenkins p g_num 256 pgp_num 256 last_change 86 flags hashpspool stripe_width 0 pool 1 'media' replicated *size 2 min_size 1* crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 60 flags hashpspool stripe_width 0 pool 2 'data' replicated size *2 min_size 1* crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 63 flags hashpspool stripe_width 0 pool 3 'cephfs_test' replicated *size 2 min_size 1* crush_ruleset 0 object_hash rj enkins pg_num 256 pgp_num 256 last_change 71 flags hashpspool crash_replay_inter val 45 stripe_width 0 pool 4 'cephfs_metadata' replicated *size 2 min_size 1* crush_ruleset 0 object_has h rjenkins pg_num 256 pgp_num 256 last_change 69 flags hashpspool stripe_width 0 max_osd 4 osd.0 up in weight 1 up_from 55 up_thru 86 down_at 51 last_clean_interval [39 ,50) 192.168.30.21:6800/17319 10.1.1.21:6800/17319 10.1.1.21:6801/17319 192.168. 30.21:6801/17319 exists,up 4f3172e1-adb8-4ca3-94af-6f0b8fcce35a osd.1 up in weight 1 up_from 57 up_thru 86 down_at 53 last_clean_interval [41 ,52) 192.168.30.21:6803/17684 10.1.1.21:6802/17684 10.1.1.21:6804/17684 192.168. 30.21:6805/17684 exists,up 1790347a-94fa-4b81-b429-1e7c7f11d3fd osd.2 up in weight 1 up_from 79 up_thru 86 down_at 74 last_clean_interval [13 ,73) 192.168.30.22:6801/3178 10.1.1.22:6800/3178 10.1.1.22:6801/3178 192.168.30. 22:6802/3178 exists,up 5520835f-c411-4750-974b-34e9aea2585d osd.3 up in weight 1 up_from 81 up_thru 86 down_at 72 last_clean_interval [20 ,71) 192.168.30.22:6804/3414 10.1.1.22:6802/3414 10.1.1.22:6803/3414 192.168.30. 22:6805/3414 exists,up 25e62059-6392-4a69-99c9-214ae335004 Thx Jiri On 1/01/2015 15:21, Lindsay Mathieson wrote: On Thu, 1 Jan 2015 02:59:05 PM Jiri Kanicky wrote: I would expect that if I shut down one node, the system will keep running. But when I tested it, I cannot even execute "ceph status" command on the running node. 2 osd Nodes, 3 Mon nodes here, works perfectly for me. How many monitors do you have? Maybe you need a third monitor only node for quorum? I set "osd_pool_default_size = 2" (min_size=1) on all pools, so I thought that each copy will reside on each node. Which means that if 1 node goes down the second one will be still operational. does: ceph osd pool get {pool name} size return 2 ceph osd pool get {pool name} min_size return 1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] redundancy with 2 nodes
Hi, Is it possible to achieve redundancy with 2 nodes only? cephadmin@ceph1:~$ ceph osd tree # idweight type name up/down reweight -1 10.88 root default -2 5.44host ceph1 0 2.72osd.0 up 1 1 2.72osd.1 up 1 -3 5.44host ceph2 2 2.72osd.2 up 1 3 2.72osd.3 up 1 cephadmin@ceph1:~$ ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_OK monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 12, quorum 0,1 ceph1,ceph2 mdsmap e7: 1/1/1 up {0=ceph1=up:active}, 1 up:standby osdmap e88: 4 osds: 4 up, 4 in pgmap v2051: 1280 pgs, 5 pools, 13184 MB data, 3328 objects 26457 MB used, 11128 GB / 11158 GB avail 1280 active+clean I would expect that if I shut down one node, the system will keep running. But when I tested it, I cannot even execute "ceph status" command on the running node. I set "osd_pool_default_size = 2" (min_size=1) on all pools, so I thought that each copy will reside on each node. Which means that if 1 node goes down the second one will be still operational. I think my assumptions are wrong, but I could not find the explanation why. Thanks Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel module reports error on mount
Hi. I have got the same message in Debian Jessie, while the CephFS mounts and works fine. Jiri. On 18/12/2014 01:00, John Spray wrote: Hmm, from a quick google it appears you are not the only one who has seen this symptom with mount.ceph. Our mtab code appears to have diverged a bit from the upstream util-linux repo, so it seems entirely possible we have a bug in ours somewhere. I've opened http://tracker.ceph.com/issues/10351 to track it. Cheers, John On Wed, Dec 17, 2014 at 1:31 PM, Lindsay Mathieson wrote: mount reports: "mount: error writing /etc/mtab: Invalid argument" fstab entry is: vnb.proxmox.softlog,vng.proxmox.softlog,vnt.proxmox.softlog:/ /mnt/test ceph _netdev,defaults,name=admin,secretfile=/etc/pve/priv/admin.secret 0 0 However the mounts is successful and a mtab entry is made. debian wheezy, ceph 0.87 -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi Christian, Thank you for the valuable info. As I will use this cluster mainly at home for my data, and testing (backup in place), I will continue to use BTRFS. In production, I would go with XFS as recommended. ZFS - perhaps when this will become supported officially. BTW, I fixed the HEALTH of my cluster: 1. I set "ceph osd pool set rbd size 2" 2. I set "ceph osd pool set rbd pg_num 256" and "ceph osd pool set rbd pgp_num 256" 5 pgs remained stuck unclean (stuck unclean since forever, current state active, last acting). I fixed this by restarting ceph -a. I think the OSD restart fixed this. I guess there might be more elegant solution, but I was not able to figure it out. Tried "pg repair" but that didn't do trick. Anyway, it seems to be healthy now :). cephadmin@ceph1:~$ sudo ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_OK monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 10, quorum 0,1 ceph1,ceph2 osdmap e59: 4 osds: 4 up, 4 in pgmap v179: 256 pgs, 1 pools, 0 bytes data, 0 objects 16924 kB used, 11154 GB / 11158 GB avail 256 active+clean Thanks for the help! Jiri On 28/12/2014 16:59, Christian Balzer wrote: Hello Jiri, On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote: Hi Christian. Thank you for your comments again. Very helpful. I will try to fix the current pool and see how it goes. Its good to learn some troubleshooting skills. Indeed, knowing what to do when things break is where it's at. Regarding the BTRFS vs XFS, not sure if the documentation is old. My decision was based on this: http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ It's dated for sure and a bit of wishful thinking on behalf of the Ceph developers. Who understandably didn't want to re-invent the wheel inside Ceph when the underlying file system could provide it (checksums, snapshots, etc). ZFS has all the features (and much better tested) BTRFS is aspiring to and if kept below 80% utilization doesn't fragment itself to death. And the end of that page they mention deduplication, which of course (as I wrote recently in the "use ZFS for OSDs" thread is unlikely to do anything worthwhile at all. Simply put, some things _need_ to be done in Ceph to work properly and can't be delegated to the underlying FS or other storage backend. Christian Note We currently recommendXFSfor production deployments. We recommendbtrfsfor testing, development, and any non-critical deployments. *We believe thatbtrfshas the correct feature set and roadmap to serve Ceph in the long-term*, butXFSandext4provide the necessary stability for today’s deployments.btrfsdevelopment is proceeding rapidly: users should be comfortable installing the latest released upstream kernels and be able to track development activity for critical bug fixes. Thanks Jiri On 28/12/2014 16:01, Christian Balzer wrote: Hello, On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote: Hi Christian. Thank you for your suggestions. I will set the "osd pool default size" to 2 as you recommended. As mentioned the documentation is talking about OSDs, not nodes, so that must have confused me. Note that changing this will only affect new pools of course. So to sort out your current state either start over with this value set before creating/starting anything or reduce the current size (ceph osd pool set size). Have a look at the crushmap example or even better your own, current one and you will see where by default the host is the failure domain. Which of course makes a lot of sense. Regarding the BTRFS, i thought that btrfs is better option for the future providing more features. I know that XFS might be more stable, but again my impression was that btrfs is the focus for future development. Is that correct? I'm not a developer, but if you scour the ML archives you will find a number of threads about BTRFS (and ZFS). The biggest issues with BTRFS are not just stability but also the fact that it degrades rather quickly (fragmentation) due to the COW nature of it and less smarts than ZFS in that area. So development on the Ceph side is not the issue per se. IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might become the better choice (in the future), with KV store backends being an alternative for some use cases (also far from production ready at this time). Regards, Christian You are right with the round up. I forgot about that. Thanks for your help. Much appreciated. Jiri - Reply message - From: "Christian Balzer" To: Cc: "Jiri Kanicky" Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec 28, 2014 03:29 Hello, On Sun, 28 Dec 2014 01:52:
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi Christian. Thank you for your comments again. Very helpful. I will try to fix the current pool and see how it goes. Its good to learn some troubleshooting skills. Regarding the BTRFS vs XFS, not sure if the documentation is old. My decision was based on this: http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ Note We currently recommendXFSfor production deployments. We recommendbtrfsfor testing, development, and any non-critical deployments. *We believe thatbtrfshas the correct feature set and roadmap to serve Ceph in the long-term*, butXFSandext4provide the necessary stability for today’s deployments.btrfsdevelopment is proceeding rapidly: users should be comfortable installing the latest released upstream kernels and be able to track development activity for critical bug fixes. Thanks Jiri On 28/12/2014 16:01, Christian Balzer wrote: Hello, On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote: Hi Christian. Thank you for your suggestions. I will set the "osd pool default size" to 2 as you recommended. As mentioned the documentation is talking about OSDs, not nodes, so that must have confused me. Note that changing this will only affect new pools of course. So to sort out your current state either start over with this value set before creating/starting anything or reduce the current size (ceph osd pool set size). Have a look at the crushmap example or even better your own, current one and you will see where by default the host is the failure domain. Which of course makes a lot of sense. Regarding the BTRFS, i thought that btrfs is better option for the future providing more features. I know that XFS might be more stable, but again my impression was that btrfs is the focus for future development. Is that correct? I'm not a developer, but if you scour the ML archives you will find a number of threads about BTRFS (and ZFS). The biggest issues with BTRFS are not just stability but also the fact that it degrades rather quickly (fragmentation) due to the COW nature of it and less smarts than ZFS in that area. So development on the Ceph side is not the issue per se. IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might become the better choice (in the future), with KV store backends being an alternative for some use cases (also far from production ready at this time). Regards, Christian You are right with the round up. I forgot about that. Thanks for your help. Much appreciated. Jiri - Reply message - From: "Christian Balzer" To: Cc: "Jiri Kanicky" Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec 28, 2014 03:29 Hello, On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: Hi, I just build my CEPH cluster but having problems with the health of the cluster. You're not telling us the version, but it's clearly 0.87 or beyond. Here are few details: - I followed the ceph documentation. Outdated, unfortunately. - I used btrfs filesystem for all OSDs Big mistake number 1, do some research (google, ML archives). Though not related to to your problems. - I did not set "osd pool default size = 2 " as I thought that if I have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right. Big mistake, assumption number 2, replications size by the default CRUSH rule is determined by hosts. So that's your main issue here. Either set it to 2 or use 3 hosts. - I noticed that default pools "data,metadata" were not created. Only "rbd" pool was created. See outdated docs above. The majority of use cases is with RBD, so since Giant the cephfs pools are not created by default. - As it was complaining that the pg_num is too low, I increased the pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num 133 > pgp_num 64". Re-read the (in this case correct) documentation. It clearly states to round up to nearest power of 2, in your case 256. Regards. Christian Would you give me hint where I have made the mistake? (I can remove the OSDs and start over if needed.) cephadmin@ceph1:/etc/ceph$ sudo ceph health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 cephadmin@ceph1:/etc/ceph$ sudo ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 8, quorum 0,1 ceph1,ceph2 osdmap e42: 4 osds: 4 up, 4 in pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects 11704 kB used, 11154 GB / 1115
[ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi, I just build my CEPH cluster but having problems with the health of the cluster. Here are few details: - I followed the ceph documentation. - I used btrfs filesystem for all OSDs - I did not set "osd pool default size = 2 " as I thought that if I have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right. - I noticed that default pools "data,metadata" were not created. Only "rbd" pool was created. - As it was complaining that the pg_num is too low, I increased the pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num 133 > pgp_num 64". Would you give me hint where I have made the mistake? (I can remove the OSDs and start over if needed.) cephadmin@ceph1:/etc/ceph$ sudo ceph health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 cephadmin@ceph1:/etc/ceph$ sudo ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 8, quorum 0,1 ceph1,ceph2 osdmap e42: 4 osds: 4 up, 4 in pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects 11704 kB used, 11154 GB / 11158 GB avail 29 active+undersized+degraded 104 active+remapped cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree # idweight type name up/down reweight -1 10.88 root default -2 5.44host ceph1 0 2.72osd.0 up 1 1 2.72osd.1 up 1 -3 5.44host ceph2 2 2.72osd.2 up 1 3 2.72osd.3 up 1 cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools 0 rbd, cephadmin@ceph1:/etc/ceph$ cat ceph.conf [global] fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 public_network = 192.168.30.0/24 cluster_network = 10.1.1.0/24 mon_initial_members = ceph1, ceph2 mon_host = 192.168.30.21,192.168.30.22 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] do I have to use sudo for CEPH install
Hi. Do I have to install sudo in Debian Wheezy to deploy CEPH succesfully? I dont normally use sudo. Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com