[ceph-users] Remove RBD Image
Hi all, I am trying to remove several rbd images from the cluster. Unfortunately, that doesn't work: $ rbd info foo rbd image 'foo': size 1024 GB in 262144 objects order 22 (4096 kB objects) block_name_prefix: rb.0.919443.238e1f29 format: 1 $ rbd rm foo 2015-07-29 10:25:01.438296 7f868d330760 -1 librbd: image has watchers - not removing Removing image: 0% complete...failed. rbd: error: image still has watchers This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. $ rados -p rbd listwatchers foo error listing watchers rbd/foo: (2) No such file or directory Well, that is quite frustrating. The image was mapped on one host, where I was unmapping it. What do I have to do to get rid of it? We are using ceph version 0.87.2 Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Remove RBD Image
Hi Ilya, that worked for me and actually pointed out that one of my collegues currently had the rbd pool locally mounted via fuse-rbd, which obviously locks all images in this pool. Problem solved! Thanks! Regards, Christian Am 29.07.2015 um 11:48 schrieb Ilya Dryomov: On Wed, Jul 29, 2015 at 11:30 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi all, I am trying to remove several rbd images from the cluster. Unfortunately, that doesn't work: $ rbd info foo rbd image 'foo': size 1024 GB in 262144 objects order 22 (4096 kB objects) block_name_prefix: rb.0.919443.238e1f29 format: 1 $ rbd rm foo 2015-07-29 10:25:01.438296 7f868d330760 -1 librbd: image has watchers - not removing Removing image: 0% complete...failed. rbd: error: image still has watchers This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. $ rados -p rbd listwatchers foo error listing watchers rbd/foo: (2) No such file or directory For a format 1 image, you need to do $ rados -p rbd listwatchers foo.rbd rbd status command was recently introduced to abstract this, but it's not in 0.87. Well, that is quite frustrating. The image was mapped on one host, where I was unmapping it. What do I have to do to get rid of it? Did you unmap the image? What is the output of rbd showmapped on the host you had it mapped? Is there anything rbd or ceph related in dmesg on that host? Thanks, Ilya -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Scrub Error / How does ceph pg repair work?
Hi Christian, Hi Robert, thank you for your replies! I was already expecting something like this. But I am seriously worried about that! Just assume that this is happening at night. Our shift has not necessarily enough knowledge to perform all the steps in Sebasien's article. And if we always have to do that when a scrub error appears, we are putting several hours per week into fixing such problems. It is also very misleading that a command called ceph pg repair might do quite the opposit and overwrite the good data in your cluster with corrupt one. I don't know much about the interna of ceph, but if the cluster can already recognize that checksums are not the same, why can't he just build a quorum from the existing replicas if possible? And again the question: Are these placementgroups (scrub error, inconsistent) blocking on read/write requests? Because if yes, we have a serious problem here... Regards, Christian Am 12.05.2015 um 08:20 schrieb Christian Balzer: Hello, I can only nod emphatically to what Robert said, don't issue repairs unless you a) don't care about the data or b) have verified that your primary OSD is good. See this for some details on how establish which replica(s) are actually good or not: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ Of course if you somehow wind up with more subtle data corruption and are faced with 3 slightly differing data sets, you may have have to resort to rolling a dice after all. A word from the devs about the state of checksums and automatic repairs we can trust would be appreciated. Christian On Mon, 11 May 2015 10:19:08 -0600 Robert LeBlanc wrote: Personally I would not just run this command automatically because as you stated, it only copies the primary PGs to the replicas and if the primary is corrupt, you will corrupt your secondaries.I think the monitor log shows which OSD has the problem so if it is not your primary, then just issue the repair command. There was talk, and I believe work towards, Ceph storing a hash of the object so that it can be smarter about which replica has the correct data and automatically replicate the good data no matter where it is. I think the first part, creating the hash and storing it, has been included in Hammer. I'm not an authority on this so take it with a grain of salt. Right now our procedure is to find the PG files on the OSDs, perform a MD5 on all of them and the one that doesn't match, overwrite, either by issuing the PG repair command, or removing the bad PG files, rsyncing them with the -X argument and then instructing a deep-scrub on the PG to clear it up in Ceph. I've only tested this on an idle cluster, so I don't know how well it will work on an active cluster. Since we issue a deep-scrub, if the PGs of the replicas change during the rsync, it should come up with an error. The idea is to keep rsyncing until the deep-scrub is clean. Be warned that you may be aiming your gun at your foot with this! Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, May 11, 2015 at 2:09 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi all! We are experiencing approximately 1 scrub error / inconsistent pg every two days. As far as I know, to fix this you can issue a ceph pg repair, which works fine for us. I have a few qestions regarding the behavior of the ceph cluster in such a case: 1. After ceph detects the scrub error, the pg is marked as inconsistent. Does that mean that any IO to this pg is blocked until it is repaired? 2. Is this amount of scrub errors normal? We currently have only 150TB in our cluster, distributed over 720 2TB disks. 3. As far as I know, a ceph pg repair just copies the content of the primary pg to all replicas. Is this still the case? What if the primary copy is the one having errors? We have a 4x replication level and it would be cool if ceph would use one of the pg for recovery which has the same checksum as the majority of pgs. 4. Some of this errors are happening at night. Since ceph reports this as a critical error, our shift is called and wake up, just to issue a single command. Do you see any problems in triggering this command automatically via monitoring event? Is there a reason why ceph isn't resolving these errors itself when it has enought replicas to do so? Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr
[ceph-users] Scrub Error / How does ceph pg repair work?
Hi all! We are experiencing approximately 1 scrub error / inconsistent pg every two days. As far as I know, to fix this you can issue a ceph pg repair, which works fine for us. I have a few qestions regarding the behavior of the ceph cluster in such a case: 1. After ceph detects the scrub error, the pg is marked as inconsistent. Does that mean that any IO to this pg is blocked until it is repaired? 2. Is this amount of scrub errors normal? We currently have only 150TB in our cluster, distributed over 720 2TB disks. 3. As far as I know, a ceph pg repair just copies the content of the primary pg to all replicas. Is this still the case? What if the primary copy is the one having errors? We have a 4x replication level and it would be cool if ceph would use one of the pg for recovery which has the same checksum as the majority of pgs. 4. Some of this errors are happening at night. Since ceph reports this as a critical error, our shift is called and wake up, just to issue a single command. Do you see any problems in triggering this command automatically via monitoring event? Is there a reason why ceph isn't resolving these errors itself when it has enought replicas to do so? Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC
Hi Dan, we are alreay back on the kernel module since the same problems were happening with fuse. I had no special ulimit settings for the fuse-process, so that could have been an issue there. I was pasting you the kernel messages during such incidents here: http://pastebin.com/X5JRe1v3 I was never debugging the kernel client. Can you give me a short hint how to increase the debug level and where the logs will be written to? Regards, Christian Am 20.04.2015 um 15:50 schrieb Dan van der Ster: Hi, This is similar to what you would observe if you hit the ulimit on open files/sockets in a Ceph client. Though that normally only affects clients in user mode, not the kernel. What are the ulimits of your rbd-fuse client? Also, you could increase the client logging debug levels to see why the client is hanging. When the kernel rbd client was hanging, was there anything printed to dmesg ? Cheers, Dan On Mon, Apr 20, 2015 at 9:29 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and export them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to fix the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC
Hi Onur, actual 50, ideal 330128, fragmentation factor 0.97% so fragmentation is not an issue here. Regards, Christian Am 20.04.2015 um 16:41 schrieb Onur BEKTAS: Hi, Check xfs fregmentation factor for rbd disks i.e. xfs_db -c frag -r /dev/sdX if it is high, try defrag xfs_fsr /dev/sdX Regards, Onur. On 4/20/2015 4:41 PM, Nick Fisk wrote: If possible, it might be worth trying an EXT4 formatted RBD. I've had problems with XFS hanging in the past on simple LVM volumes and never really got to the bottom of it, whereas the same volumes formatted with EXT4 has been running for years without a problem. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Eichelmann Sent: 20 April 2015 14:41 To: Nick Fisk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC I'm using xfs on the rbd disks. They are between 1 and 10TB in size. Am 20.04.2015 um 14:32 schrieb Nick Fisk: Ah ok, good point What FS are you using on the RBD? -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Eichelmann Sent: 20 April 2015 13:16 To: Nick Fisk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC Hi Nick, I forgot to mention that I was also trying a workaround using the userland (rbd-fuse). The behaviour was exactly the same (worked fine for several hours, testing parallel reading and writing, then IO Wait and system load increased). This is why I don't think it is an issue with the rbd kernel module. Regards, Christian Am 20.04.2015 um 11:37 schrieb Nick Fisk: Hi Christian, A very non-technical answer but as the problem seems related to the RBD client it might be worth trying the latest Kernel if possible. The RBD client is Kernel based and so there may be a fix which might stop this from happening. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Eichelmann Sent: 20 April 2015 08:29 To: ceph-users@lists.ceph.com Subject: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and export them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to fix the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list
Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC
Hi Dan, nope, we have no iptables rules on those hosts and the gateway is on the same subnet as the ceph cluster. I will see if I can find some informations on how to debug the rbd kernel module (any suggestions are appreciated :)) Regards, Christian Am 21.04.2015 um 10:20 schrieb Dan van der Ster: Hi Christian, I've never debugged the kernel client either, so I don't know how to increase debugging. (I don't see any useful parms on the kernel modules). Your log looks like the client just stops communicating with the ceph cluster. Is iptables getting in the way ? Cheers, Dan On Tue, Apr 21, 2015 at 9:13 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi Dan, we are alreay back on the kernel module since the same problems were happening with fuse. I had no special ulimit settings for the fuse-process, so that could have been an issue there. I was pasting you the kernel messages during such incidents here: http://pastebin.com/X5JRe1v3 I was never debugging the kernel client. Can you give me a short hint how to increase the debug level and where the logs will be written to? Regards, Christian Am 20.04.2015 um 15:50 schrieb Dan van der Ster: Hi, This is similar to what you would observe if you hit the ulimit on open files/sockets in a Ceph client. Though that normally only affects clients in user mode, not the kernel. What are the ulimits of your rbd-fuse client? Also, you could increase the client logging debug levels to see why the client is hanging. When the kernel rbd client was hanging, was there anything printed to dmesg ? Cheers, Dan On Mon, Apr 20, 2015 at 9:29 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and export them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to fix the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC
I'm using xfs on the rbd disks. They are between 1 and 10TB in size. Am 20.04.2015 um 14:32 schrieb Nick Fisk: Ah ok, good point What FS are you using on the RBD? -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Eichelmann Sent: 20 April 2015 13:16 To: Nick Fisk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC Hi Nick, I forgot to mention that I was also trying a workaround using the userland (rbd-fuse). The behaviour was exactly the same (worked fine for several hours, testing parallel reading and writing, then IO Wait and system load increased). This is why I don't think it is an issue with the rbd kernel module. Regards, Christian Am 20.04.2015 um 11:37 schrieb Nick Fisk: Hi Christian, A very non-technical answer but as the problem seems related to the RBD client it might be worth trying the latest Kernel if possible. The RBD client is Kernel based and so there may be a fix which might stop this from happening. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Eichelmann Sent: 20 April 2015 08:29 To: ceph-users@lists.ceph.com Subject: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and export them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to fix the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC
Hi Nick, I forgot to mention that I was also trying a workaround using the userland (rbd-fuse). The behaviour was exactly the same (worked fine for several hours, testing parallel reading and writing, then IO Wait and system load increased). This is why I don't think it is an issue with the rbd kernel module. Regards, Christian Am 20.04.2015 um 11:37 schrieb Nick Fisk: Hi Christian, A very non-technical answer but as the problem seems related to the RBD client it might be worth trying the latest Kernel if possible. The RBD client is Kernel based and so there may be a fix which might stop this from happening. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Eichelmann Sent: 20 April 2015 08:29 To: ceph-users@lists.ceph.com Subject: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and export them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to fix the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 100% IO Wait with CEPH RBD and RSYNC
Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and export them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to fix the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread
Hi Sage, we hit this problem a few monthes ago as well and it took us quite a while to figure out what's wrong. As a Systemadministrator I don't like the idea that daemons or even init scripts are changing system wide configuration parameters, so I wouldn't like to see the OSDs do it themself. I've noticed that building ceph on high density hardware is a totally different thing with totally different problems and solutions than with common hardware. I would like to see a special section in the documentation regarding problems with that kind of hardware and ceph clusters at a larger scale. So I vote for the documentation. Sysctls are something I want to set for myself. The idea with the warning is on one hand a good hint, on the other hand it also may confuse people, since changing this setting is not required for common hardware. Regards, Christian On 03/09/2015 08:01 PM, Sage Weil wrote: On Mon, 9 Mar 2015, Karan Singh wrote: Thanks Guys kernel.pid_max=4194303 did the trick. Great to hear! Sorry we missed that you only had it at 65536. This is a really common problem that people hit when their clusters start to grow. Is there somewhere in the docs we can put this to catch more users? Or maybe a warning issued by the osds themselves or something if they see limits that are low? sage - Karan - On 09 Mar 2015, at 14:48, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi Karan, as you are actually writing in your own book, the problem is the sysctl setting kernel.pid_max. I've seen in your bug report that you were setting it to 65536, which is still to low for high density hardware. In our cluster, one OSD server has in an idle situation about 66.000 Threads (60 OSDs per Server). The number of threads increases when you increase the number of placement groups in the cluster, which I think has triggered your problem. Set the kernel.pid_max setting to 4194303 (the maximum) like Azad Aliyar suggested, and the problem should be gone. Regards, Christian Am 09.03.2015 11:41, schrieb Karan Singh: Hello Community need help to fix a long going Ceph problem. Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to restart OSD?s i am getting this error /2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc http://Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970/ /common/Thread.cc http://Thread.cc: 129: FAILED assert(ret == 0)/ *Environment *: 4 Nodes , OSD+Monitor , Firefly latest , CentOS6.5 , 3.17.2-1.el6.elrepo.x86_64 Tried upgrading from 0.80.7 to 0.80.8 but no Luck Tried centOS stock kernel 2.6.32 but no Luck Memory is not a problem more then 150+GB is free Did any one every faced this problem ?? *Cluster status * * * / cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33/ / health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; 1 pgs incomplete; 1735 pgs peering; 8938 pgs stale; 1/ /736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs stuck unclean; recovery 6061/31080 objects degraded (19/ /.501%); 111/196 in osds are down; clock skew detected on mon.pouta-s02, mon.pouta-s03/ / monmap e3: 3 mons at {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX .50.3:6789/ //0}, election epoch 1312, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03/ / * osdmap e26633: 239 osds: 85 up, 196 in*/ / pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, 10360 objects/ /4699 GB used, 707 TB / 711 TB avail/ /6061/31080 objects degraded (19.501%)/ / 14 down+remapped+peering/ / 39 active/ /3289 active+clean/ / 547 peering/ / 663 stale+down+peering/ / 705 stale+active+remapped/ / 1 active+degraded+remapped/ / 1 stale+down+incomplete/ / 484 down+peering/ / 455 active+remapped/ /3696 stale+active+degraded/ / 4 remapped+peering/ / 23 stale+down+remapped+peering/ / 51 stale+active/ /3637 active+degraded/ /3799 stale+active+clean/ *OSD : Logs * /2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc http://Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970/ /common/Thread.cc http://Thread.cc: 129: FAILED assert(ret == 0)/ / / / ceph version 0.80.8
Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread
Hi Karan, as you are actually writing in your own book, the problem is the sysctl setting kernel.pid_max. I've seen in your bug report that you were setting it to 65536, which is still to low for high density hardware. In our cluster, one OSD server has in an idle situation about 66.000 Threads (60 OSDs per Server). The number of threads increases when you increase the number of placement groups in the cluster, which I think has triggered your problem. Set the kernel.pid_max setting to 4194303 (the maximum) like Azad Aliyar suggested, and the problem should be gone. Regards, Christian Am 09.03.2015 11:41, schrieb Karan Singh: Hello Community need help to fix a long going Ceph problem. Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to restart OSD’s i am getting this error /2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc http://Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970/ /common/Thread.cc http://Thread.cc: 129: FAILED assert(ret == 0)/ *Environment *: 4 Nodes , OSD+Monitor , Firefly latest , CentOS6.5 , 3.17.2-1.el6.elrepo.x86_64 Tried upgrading from 0.80.7 to 0.80.8 but no Luck Tried centOS stock kernel 2.6.32 but no Luck Memory is not a problem more then 150+GB is free Did any one every faced this problem ?? *Cluster status * * * / cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33/ / health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; 1 pgs incomplete; 1735 pgs peering; 8938 pgs stale; 1/ /736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs stuck unclean; recovery 6061/31080 objects degraded (19/ /.501%); 111/196 in osds are down; clock skew detected on mon.pouta-s02, mon.pouta-s03/ / monmap e3: 3 mons at {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX.50.3:6789/ //0}, election epoch 1312, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03/ / * osdmap e26633: 239 osds: 85 up, 196 in*/ / pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, 10360 objects/ /4699 GB used, 707 TB / 711 TB avail/ /6061/31080 objects degraded (19.501%)/ / 14 down+remapped+peering/ / 39 active/ /3289 active+clean/ / 547 peering/ / 663 stale+down+peering/ / 705 stale+active+remapped/ / 1 active+degraded+remapped/ / 1 stale+down+incomplete/ / 484 down+peering/ / 455 active+remapped/ /3696 stale+active+degraded/ / 4 remapped+peering/ / 23 stale+down+remapped+peering/ / 51 stale+active/ /3637 active+degraded/ /3799 stale+active+clean/ *OSD : Logs * /2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc http://Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time 2015-03-09 12:22:16.311970/ /common/Thread.cc http://Thread.cc: 129: FAILED assert(ret == 0)/ / / / ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)/ / 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]/ / 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]/ / 3: (Accepter::entry()+0x265) [0xb5c635]/ / 4: /lib64/libpthread.so.0() [0x3c8a6079d1]/ / 5: (clone()+0x6d) [0x3c8a2e89dd]/ / NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this./ *More information at Ceph Tracker Issue : *http://tracker.ceph.com/issues/10988#change-49018 Karan Singh Systems Specialist , Storage Platforms CSC - IT Center for Science, Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland mobile: +358 503 812758 tel. +358 9 4572001 fax +358 9 4572302 http://www.csc.fi/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down
Am 05.02.2015 10:10, schrieb Dan van der Ster: But then when I restarted the (peon) monitor: 2015-01-29 11:29:18.250750 mon.0 128.142.35.220:6789/0 10570 : [INF] pgmap v35847068: 24608 pgs: 1 active+clean+scrubbing+deep, 24602 active+clean, 5 active+clean+scrubbing; 125 T B data, 377 TB used, 2021 TB / 2399 TB avail; 193 MB/s rd, 238 MB/s wr, 7410 op/s 2015-01-29 11:29:28.844678 mon.3 128.142.39.77:6789/0 1 : [INF] mon.2 calling new monitor election 2015-01-29 11:29:33.846946 mon.2 128.142.36.229:6789/0 9 : [INF] mon.4 calling new monitor election 2015-01-29 11:29:33.847022 mon.4 128.142.39.144:6789/0 7 : [INF] mon.3 calling new monitor election 2015-01-29 11:29:33.847085 mon.1 128.142.36.227:6789/0 24 : [INF] mon.1 calling new monitor election 2015-01-29 11:29:33.853498 mon.3 128.142.39.77:6789/0 2 : [INF] mon.2 calling new monitor election 2015-01-29 11:29:33.895660 mon.0 128.142.35.220:6789/0 10860 : [INF] mon.0 calling new monitor election 2015-01-29 11:29:33.901335 mon.0 128.142.35.220:6789/0 10861 : [INF] mon.0@0 won leader election with quorum 0,1,2,3,4 2015-01-29 11:29:34.004028 mon.0 128.142.35.220:6789/0 10862 : [INF] monmap e5: 5 mons at {0=128.142.35.220:6789/0,1=128.142.36.227:6789/0,2=128.142.39.77:6789/0,3=128.142.39.144:6789/0,4=128.142.36.229:6789/0} 2015-01-29 11:29:34.005808 mon.0 128.142.35.220:6789/0 10863 : [INF] pgmap v35847069: 24608 pgs: 1 active+clean+scrubbing+deep, 24602 active+clean, 5 active+clean+scrubbing; 125 TB data, 377 TB used, 2021 TB / 2399 TB avail; 54507 kB/s rd, 85412 kB/s wr, 1967 op/s 2015-01-29 11:29:34.006111 mon.0 128.142.35.220:6789/0 10864 : [INF] mdsmap e157: 1/1/1 up {0=0=up:active} 2015-01-29 11:29:34.007165 mon.0 128.142.35.220:6789/0 10865 : [INF] osdmap e132055: 880 osds: 880 up, 880 in 2015-01-29 11:29:34.037367 mon.0 128.142.35.220:6789/0 11055 : [INF] osd.1202 128.142.23.104:6801/98353 failed (4 reports from 3 peers after 29.673699 = grace 28.948726) 2015-01-29 11:29:34.050478 mon.0 128.142.35.220:6789/0 11139 : [INF] osd.1164 128.142.23.102:6850/22486 failed (3 reports from 2 peers after 30.685537 = grace 28.946983) and then just after: 2015-01-29 11:29:35.210184 osd.1202 128.142.23.104:6801/98353 59 : [WRN] map e132056 wrongly marked me down 2015-01-29 11:29:35.441922 osd.1164 128.142.23.102:6850/22486 25 : [WRN] map e132056 wrongly marked me down The behaviour is exactly the same on our system, to it looks like the same issue. We are current running Giant by the way (0.87) plus many other OSDs like that. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Monitor Restart triggers half of our OSDs marked down
Hi all, during some failover tests and some configuration tests, we currently discover a strange phenomenon: Restarting one of our monitors (5 in sum) triggers about 300 of the following events: osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after 22.005858 = grace 20.00) The osds come back up shortly after the have been marked down. What I don't understand is: How can a restart of one monitor prevent the osds from talking to each other and marking them down? FYI: We are currently using the following settings: mon osd adjust hearbeat grace = false mon osd min down reporters = 20 mon osd adjust down out interval = false Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of Ceph while OSDs are down
Hi Samuel, Hi Gregory, we are using Giant (0.87). Sure, I was checking on this PGs. The strange thing was, that they reported a bad state (state: inactive), but looking at the recovery state, everything seems to be fine. That would point to the mentioned bug. Do you have a link to this bug, so I can have a look at it to confirm that we are having the same issues? Here is a pg_query (slightly older and with only 3x replication, so don't be confused): http://pastebin.com/fyC8Qepv Regards, Christian On 01/20/2015 10:57 PM, Samuel Just wrote: Version? -Sam On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum g...@gregs42.com wrote: On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi all, I want to understand what Ceph does if several OSDs are down. First of our, some words to our Setup: We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server. We have a replication factor of 4 and a crush rule applied that says step chooseleaf firstn 0 type rack. So, in my oppinion, every rack should hold a copy of all the data in our ceph cluster. Is that more or less correct? So, our cluster is in state health OK and I am rebooting one of our OSD servers. That means 60 of 720 OSDs are going down. Since this hardware takes quite some time to boot up, we are using mon osd down out subtree limit = host to avoid rebalancing when a whole server goes down. Ceph show this output of ceph -s while the OSDs are down: health HEALTH_WARN 7 pgs degraded; 1 pgs peering; 7 pgs stuck degraded; 1 pgs stuck inactive; 8 pgs stuck unclean; 7 pgs stuck und ersized; 7 pgs undersized; recovery 623/7420 objects degraded (8.396%); 60/720 in osds are down monmap e5: 5 mons at {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1 0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4 mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03 osdmap e60390: 720 osds: 660 up, 720 in pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects 3948 GB used, 1304 TB / 1308 TB avail 623/7420 objects degraded (8.396%) 45356 active+clean 1 peering 7 active+undersized+degraded The pgs that are degraded and undersized are not a problem, since this behaviour is expected. I am worried about the peering pg (it stays in this state until all osds are up again) since this would cause I/O to hang if I am not mistaken. After the host is back up and all OSDs are up and running again, I see this: health HEALTH_WARN 2 pgs stuck unclean monmap e5: 5 mons at {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4 mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03 osdmap e60461: 720 osds: 720 up, 720 in pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects 3972 GB used, 1304 TB / 1308 TB avail 2 inactive 67582 active+clean Without any interaction, it will stay in this state. I guess these two inactive pgs will also cause I/O to hang? Some more information: ceph health detail HEALTH_WARN 2 pgs stuck unclean pg 9.f765 is stuck unclean for 858.298811, current state inactive, last acting [91,362,484,553] pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last acting [91,233,485,524] I was trying to give osd.91 a kick with ceph osd down 91 After the osd is back in the cluster: health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck unclean So even worse. I decided to take the osd out. The cluster goes back to HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing, ending with the cluster in an OK state again. That actually happens everytime when there are some OSDs going down. I don't understand why the cluster is not able to get back to a healthy state without admin interaction. In a setup with several hundred OSDs it is normal business that some of the go down from time to time. Are there any ideas why this is happening? Right now, we do not have many data in our cluster, so I can do some tests. Any suggestions would be appreciated. Have you done any digging into the state of the PGs reported as peering or inactive or whatever when this pops up? Running pg_query, looking at their calculated and acting sets, etc. I suspect it's more likely you're exposing a reporting bug with stale data, rather than actually stuck PGs, but it would take more information to check that out. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users
[ceph-users] Behaviour of Ceph while OSDs are down
Hi all, I want to understand what Ceph does if several OSDs are down. First of our, some words to our Setup: We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server. We have a replication factor of 4 and a crush rule applied that says step chooseleaf firstn 0 type rack. So, in my oppinion, every rack should hold a copy of all the data in our ceph cluster. Is that more or less correct? So, our cluster is in state health OK and I am rebooting one of our OSD servers. That means 60 of 720 OSDs are going down. Since this hardware takes quite some time to boot up, we are using mon osd down out subtree limit = host to avoid rebalancing when a whole server goes down. Ceph show this output of ceph -s while the OSDs are down: health HEALTH_WARN 7 pgs degraded; 1 pgs peering; 7 pgs stuck degraded; 1 pgs stuck inactive; 8 pgs stuck unclean; 7 pgs stuck und ersized; 7 pgs undersized; recovery 623/7420 objects degraded (8.396%); 60/720 in osds are down monmap e5: 5 mons at {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1 0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4 mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03 osdmap e60390: 720 osds: 660 up, 720 in pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects 3948 GB used, 1304 TB / 1308 TB avail 623/7420 objects degraded (8.396%) 45356 active+clean 1 peering 7 active+undersized+degraded The pgs that are degraded and undersized are not a problem, since this behaviour is expected. I am worried about the peering pg (it stays in this state until all osds are up again) since this would cause I/O to hang if I am not mistaken. After the host is back up and all OSDs are up and running again, I see this: health HEALTH_WARN 2 pgs stuck unclean monmap e5: 5 mons at {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4 mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03 osdmap e60461: 720 osds: 720 up, 720 in pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects 3972 GB used, 1304 TB / 1308 TB avail 2 inactive 67582 active+clean Without any interaction, it will stay in this state. I guess these two inactive pgs will also cause I/O to hang? Some more information: ceph health detail HEALTH_WARN 2 pgs stuck unclean pg 9.f765 is stuck unclean for 858.298811, current state inactive, last acting [91,362,484,553] pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last acting [91,233,485,524] I was trying to give osd.91 a kick with ceph osd down 91 After the osd is back in the cluster: health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck unclean So even worse. I decided to take the osd out. The cluster goes back to HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing, ending with the cluster in an OK state again. That actually happens everytime when there are some OSDs going down. I don't understand why the cluster is not able to get back to a healthy state without admin interaction. In a setup with several hundred OSDs it is normal business that some of the go down from time to time. Are there any ideas why this is happening? Right now, we do not have many data in our cluster, so I can do some tests. Any suggestions would be appreciated. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Placementgroups stuck peering
Hi all, after our cluster problems with incomplete placementgroups, we've decided to remove our pools and create new ones. This was going fine in the beginning. After adding an additional OSD server, we now have 2 PGs that are stuck in the peering state: HEALTH_WARN 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean pg 9.2e41 is stuck inactive for 52540.202628, current state peering, last acting [91,240,273] pg 9.bad5 is stuck inactive for 52540.077013, current state peering, last acting [335,64,273] pg 9.2e41 is stuck unclean for 65683.195508, current state peering, last acting [91,240,273] pg 9.bad5 is stuck unclean for 65683.218581, current state peering, last acting [335,64,273] pg 9.bad5 is peering, acting [335,64,273] pg 9.2e41 is peering, acting [91,240,273] I was checking the placementgroups with ceph pg query, but I found no reasons why the peering can not be completed. The out of ceph pg 9.2e41 query: http://pastebin.com/fyC8Qepv Any ideas? Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Hi Lionel, we have a ceph cluster with in sum about 1PB, 12 OSDs with 60 Disks, devided into 4 racks in 2 rooms, all connected with a dedicated 10G cluster network. Of course with a replication level of 3. We did about 9 Month intensive testing. Just like you, we were never experiences that kind of problems before. And incomplete PG was recovering as soon as at least one OSD holding a copy of it came back up. We still don't know what caused this specific error, but at no point there were more than two hosts down at the same time. Our pool has a min_size of 1. And after everything was up again, we had completely LOST 2 of 3 pg copies (the directories on the OSDs were empty) and the third copy was obvioulsy broken, because even manually injecting this pg into the other osds didn't changed anything. My main problem here is, that with even one incomplete PG your pool is rendered unusable. And there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way to make this pool usable again is to loose all your data in there. Which for me is just not acceptable. Regards, Christian Am 07.01.2015 21:10, schrieb Lionel Bouton: On 12/30/14 16:36, Nico Schottelius wrote: Good evening, we also tried to rescue data *from* our old / broken pool by map'ing the rbd devices, mounting them on a host and rsync'ing away as much as possible. However, after some time rsync got completly stuck and eventually the host which mounted the rbd mapped devices decided to kernel panic at which time we decided to drop the pool and go with a backup. This story and the one of Christian makes me wonder: Is anyone using ceph as a backend for qemu VM images in production? Yes with Ceph 0.80.5 since September after extensive testing over several months (including an earlier version IIRC) and some hardware failure simulations. We plan to upgrade one storage host and one monitor to 0.80.7 to validate this version over several months too before migrating the others. And: Has anyone on the list been able to recover from a pg incomplete / stuck situation like ours? Only by adding back an OSD with the data needed to reach min_size for said pg, which is expected behavior. Even with some experimentations with isolated unstable OSDs I've not yet witnessed a case where Ceph lost multiple replicates simultaneously (we lost one OSD to disk failure and another to a BTRFS bug but without trying to recover the filesystem so we might have been able to recover this OSD). If your setup is susceptible to situations where you can lose all replicates you will lose data but there's not much that can be done about that. Ceph actually begins to generate new replicates to replace the missing onesaftermon osd down out interval so the actual loss should not happen unless you lose (and can't recover) size OSDs on separate hosts (with default crush map) simultaneously. Before going in production you should know how long Ceph will take to fully recover from a disk or host failure by testing it with load. Your setup might not be robust if it hasn't the available disk space or the speed needed to recover quickly from such a failure. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Documentation of ceph pg num query
Hi all, as mentioned last year, our ceph cluster is still broken and unusable. We are still investigating what has happened and I am taking more deep looks into the output of ceph pg pgnum query. The problem is that I can find some informations about what some of the sections mean, but mostly I can only guess. Is there any kind of documentation where I can find some explanations of whats state there? Because without that the output is barely usefull. Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Eneko, I was trying a rbd cp before, but that was haning as well. But I couldn't find out if the source image was causing the hang or the destination image. That's why I decided to try a posix copy. Our cluster is sill nearly empty (12TB / 867TB). But as far as I understood (If not, somebody please correct me) placement groups are in genereally not shared between pools at all. Regards, Christian Am 30.12.2014 12:23, schrieb Eneko Lacunza: Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Eneko, nope, new pool has all pgs active+clean, not errors during image creation. The format command just hangs, without error. Am 30.12.2014 12:33, schrieb Eneko Lacunza: Hi Christian, New pool's pgs also show as incomplete? Did you notice something remarkable in ceph logs in the new pools image format? On 30/12/14 12:31, Christian Eichelmann wrote: Hi Eneko, I was trying a rbd cp before, but that was haning as well. But I couldn't find out if the source image was causing the hang or the destination image. That's why I decided to try a posix copy. Our cluster is sill nearly empty (12TB / 867TB). But as far as I understood (If not, somebody please correct me) placement groups are in genereally not shared between pools at all. Regards, Christian Am 30.12.2014 12:23, schrieb Eneko Lacunza: Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph PG Incomplete = Cluster unusable
Hi all, we have a ceph cluster, with currently 360 OSDs in 11 Systems. Last week we were replacing one OSD System with a new one. During that, we had a lot of problems with OSDs crashing on all of our systems. But that is not our current problem. After we got everything up and running again, we still have 3 PGs in the state incomplete. I was checking one of them directly on the systems (replication factor is 3). On two machines the directory was there but empty, on the third one, I found some content. Using ceph_objectstore_tool I exported this PG and imported it on the other nodes. Nothing changed. We only use ceph for providing rbd images. Right now, two of them are unusable, because ceph hangs when someone trys to access content in these pgs. Not bad enough, if I create a new rbd image, ceph is still using the incomplete pgs, so it is a pure gambling if a new volume will be usable or not. That, for now, makes our 900TB ceph cluster unusable because of 3 bad PGs. And right here it seems like I can't to anything. Instructing the ceph cluster to scrub, deep-scrub or repair the pg does nothing, even after several days. Checking which rbd images are affected is also not possible, because rados -p poolname ls hangs forever when it comes to one of the incomplete pgs. ceph osd lost also does actually nothing. So right now, I am OK if I lose the content of these three PGs. So how can I get the cluster back to live without deleting the whole pool which is not for discussion? Regards, Christian P.S. We are using Giant ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left
Hi Nathan, that was indeed the Problem! I was increasing the max_pid value to 65535 and the problem is gone! Thank you! It was a bit misleading that there is also a /proc/sys/kernel/threads-max, which has a much higher number. And since I was only seeing around 400 processes and wasn't aware that threads are also consuming pids, it was hard to find the root cause of this issue. After this problem is solved, I'm thinking if it is a good idea to run aout 40.000 Threads (in an idle cluster) on one machine. The system has a load around 6-7 without having traffic, maybe just because of the intense context-switching. Anyways, thats another topic. Thank you for your help! Regards, Christian Am 23.09.2014 03:21, schrieb Nathan O'Sullivan: Hi Christian, Your problem is probably that your kernel.pid_max (the maximum threads+processes across the entire system) needs to be increased - the default is 32768, which is too low for even a medium density deployment. You can test this easily enough with $ ps axms | wc -l If you get a number around the 30,000 mark then you are going to be affected. There's an issue here http://tracker.ceph.com/issues/6142 , although it doesn't seem to have gotten much traction in terms of informing users. Regards Nathan On 15/09/2014 7:13 PM, Christian Eichelmann wrote: Hi all, I have no idea why running out of filehandles should produce a out of memory error, but well. I've increased the ulimit as you told me, and nothing changed. I've noticed that the osd init script sets the max open file handles explicitly, so I was setting the corresponding option in my ceph conf. Now the limits of an OSD process look like this: Limit Soft Limit Hard Limit Units Max cpu time unlimitedunlimited seconds Max file size unlimitedunlimited bytes Max data size unlimitedunlimited bytes Max stack size8388608 unlimited bytes Max core file sizeunlimitedunlimited bytes Max resident set unlimitedunlimited bytes Max processes 2067478 2067478 processes Max open files6553665536 files Max locked memory 6553665536 bytes Max address space unlimitedunlimited bytes Max file locksunlimitedunlimited locks Max pending signals 2067478 2067478 signals Max msgqueue size 819200 819200 bytes Max nice priority 00 Max realtime priority 00 Max realtime timeout unlimitedunlimitedus Anyways, the exact same behavior as before. I was also finding a mailing on this list from someone who had the exact same problem: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040059.html Unfortunately, there was also no real solution for this problem. So again: this is *NOT* a ulimit issue. We were running emperor and dumpling on the same hardware without any issues. They first started after our upgrade to firefly. Regards, Christian Am 12.09.2014 18:26, schrieb Christian Balzer: On Fri, 12 Sep 2014 12:05:06 -0400 Brian Rak wrote: That's not how ulimit works. Check the `ulimit -a` output. Indeed. And to forestall the next questions, see man initscript, mine looks like this: --- ulimit -Hn 131072 ulimit -Sn 65536 # Execute the program. eval exec $4 --- And also a /etc/security/limits.d/tuning.conf (debian) like this: --- rootsoftnofile 65536 roothardnofile 131072 * softnofile 16384 * hardnofile 65536 --- Adjusted to your actual needs. There might be other limits you're hitting, but that is the most likely one Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy. I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals) with that kind of case and enjoy the fact that my OSDs never fail. ^o^ Christian (another one) On 9/12/2014 10:15 AM, Christian Eichelmann wrote: Hi, I am running all commands as root, so there are no limits for the processes. Regards, Christian ___ Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com] Gesendet: Freitag, 12. September 2014 15:33 An: Christian Eichelmann Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left do cat /proc/pid/limits probably you hit max processes limit or max FD limit Hi Ceph-Users, I have absolutely no idea what is going on on my systems... Hardware: 45 x 4TB Harddisks 2 x 6 Core CPUs 256GB Memory When initializing all disks and join them to the cluster, after approximately 30 OSDs
Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left
Hi all, I have no idea why running out of filehandles should produce a out of memory error, but well. I've increased the ulimit as you told me, and nothing changed. I've noticed that the osd init script sets the max open file handles explicitly, so I was setting the corresponding option in my ceph conf. Now the limits of an OSD process look like this: Limit Soft Limit Hard Limit Units Max cpu time unlimitedunlimited seconds Max file size unlimitedunlimited bytes Max data size unlimitedunlimited bytes Max stack size8388608 unlimited bytes Max core file sizeunlimitedunlimited bytes Max resident set unlimitedunlimited bytes Max processes 2067478 2067478 processes Max open files6553665536 files Max locked memory 6553665536 bytes Max address space unlimitedunlimited bytes Max file locksunlimitedunlimited locks Max pending signals 2067478 2067478 signals Max msgqueue size 819200 819200 bytes Max nice priority 00 Max realtime priority 00 Max realtime timeout unlimitedunlimitedus Anyways, the exact same behavior as before. I was also finding a mailing on this list from someone who had the exact same problem: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040059.html Unfortunately, there was also no real solution for this problem. So again: this is *NOT* a ulimit issue. We were running emperor and dumpling on the same hardware without any issues. They first started after our upgrade to firefly. Regards, Christian Am 12.09.2014 18:26, schrieb Christian Balzer: On Fri, 12 Sep 2014 12:05:06 -0400 Brian Rak wrote: That's not how ulimit works. Check the `ulimit -a` output. Indeed. And to forestall the next questions, see man initscript, mine looks like this: --- ulimit -Hn 131072 ulimit -Sn 65536 # Execute the program. eval exec $4 --- And also a /etc/security/limits.d/tuning.conf (debian) like this: --- rootsoftnofile 65536 roothardnofile 131072 * softnofile 16384 * hardnofile 65536 --- Adjusted to your actual needs. There might be other limits you're hitting, but that is the most likely one Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy. I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals) with that kind of case and enjoy the fact that my OSDs never fail. ^o^ Christian (another one) On 9/12/2014 10:15 AM, Christian Eichelmann wrote: Hi, I am running all commands as root, so there are no limits for the processes. Regards, Christian ___ Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com] Gesendet: Freitag, 12. September 2014 15:33 An: Christian Eichelmann Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left do cat /proc/pid/limits probably you hit max processes limit or max FD limit Hi Ceph-Users, I have absolutely no idea what is going on on my systems... Hardware: 45 x 4TB Harddisks 2 x 6 Core CPUs 256GB Memory When initializing all disks and join them to the cluster, after approximately 30 OSDs, other osds are crashing. When I try to start them again I see different kinds of errors. For example: Starting Ceph osd.316 on ceph-osd-bs04...already running === osd.317 === Traceback (most recent call last): File /usr/bin/ceph, line 830, in module sys.exit(main()) File /usr/bin/ceph, line 773, in main sigdict, inbuf, verbose) File /usr/bin/ceph, line 420, in new_style_command inbuf=inbuf) File /usr/lib/python2.7/dist-packages/ceph_argparse.py, line 1112, in json_command raise RuntimeError('{0}: exception {1}'.format(cmd, e)) NameError: global name 'cmd' is not defined Exception thread.error: error(can't start new thread,) in bound method Rados.__del__ of rados.Rados object at 0x29ee410 ignored or: /etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 191: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 192: /etc/init.d/ceph: Cannot fork or: /usr/bin/ceph-crush-location: 72: /usr/bin/ceph-crush-location: Cannot fork /usr/bin/ceph-crush-location: 79: /usr/bin/ceph-crush-location: Cannot fork Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fcf768c9760 time 2014-09-12 15:00:28.284735 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6
[ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left
Hi Ceph-Users, I have absolutely no idea what is going on on my systems... Hardware: 45 x 4TB Harddisks 2 x 6 Core CPUs 256GB Memory When initializing all disks and join them to the cluster, after approximately 30 OSDs, other osds are crashing. When I try to start them again I see different kinds of errors. For example: Starting Ceph osd.316 on ceph-osd-bs04...already running === osd.317 === Traceback (most recent call last): File /usr/bin/ceph, line 830, in module sys.exit(main()) File /usr/bin/ceph, line 773, in main sigdict, inbuf, verbose) File /usr/bin/ceph, line 420, in new_style_command inbuf=inbuf) File /usr/lib/python2.7/dist-packages/ceph_argparse.py, line 1112, in json_command raise RuntimeError('{0}: exception {1}'.format(cmd, e)) NameError: global name 'cmd' is not defined Exception thread.error: error(can't start new thread,) in bound method Rados.__del__ of rados.Rados object at 0x29ee410 ignored or: /etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 191: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 192: /etc/init.d/ceph: Cannot fork or: /usr/bin/ceph-crush-location: 72: /usr/bin/ceph-crush-location: Cannot fork /usr/bin/ceph-crush-location: 79: /usr/bin/ceph-crush-location: Cannot fork Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fcf768c9760 time 2014-09-12 15:00:28.284735 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) 1: /usr/bin/ceph-conf() [0x51de8f] 2: (CephContext::CephContext(unsigned int)+0xb1) [0x520fe1] 3: (common_preinit(CephInitParameters const, code_environment_t, int)+0x48) [0x52eb78] 4: (global_pre_init(std::vectorchar const*, std::allocatorchar const* *, std::vectorchar const*, std::allocatorchar const* , unsigned int, code_environment_t, int)+0x8d) [0x518d0d] 5: (main()+0x17a) [0x514f6a] 6: (__libc_start_main()+0xfd) [0x7fcf7522ceed] 7: /usr/bin/ceph-conf() [0x5168d1] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) /etc/init.d/ceph: 340: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 1: /etc/init.d/ceph: Cannot fork Traceback (most recent call last): File /usr/bin/ceph, line 830, in module sys.exit(main()) File /usr/bin/ceph, line 590, in main conffile=conffile) File /usr/lib/python2.7/dist-packages/rados.py, line 198, in __init__ librados_path = find_library('rados') File /usr/lib/python2.7/ctypes/util.py, line 224, in find_library return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) File /usr/lib/python2.7/ctypes/util.py, line 213, in _findSoname_ldconfig f = os.popen('/sbin/ldconfig -p 2/dev/null') OSError: [Errno 12] Cannot allocate memory But anyways, when I look at the memory consumption of the system: # free -m total used free sharedbuffers cached Mem:258450 25841 232609 0 18 15506 -/+ buffers/cache: 10315 248135 Swap: 3811 0 3811 There are more then 230GB of memory available! What is going on there? System: Linux ceph-osd-bs04 3.14-0.bpo.1-amd64 #1 SMP Debian 3.14.12-1~bpo70+1 (2014-07-13) x86_64 GNU/Linux Since this is happening on other Hardware as well, I don't think it's Hardware related. I have no Idea if this is an OS issue (which would be seriously strange) or a ceph issue. Since this is happening only AFTER we upgraded to firefly, I guess it has something to do with ceph. ANY idea on what is going on here would be very appreciated! Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left
Hi, I am running all commands as root, so there are no limits for the processes. Regards, Christian ___ Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com] Gesendet: Freitag, 12. September 2014 15:33 An: Christian Eichelmann Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left do cat /proc/pid/limits probably you hit max processes limit or max FD limit Hi Ceph-Users, I have absolutely no idea what is going on on my systems... Hardware: 45 x 4TB Harddisks 2 x 6 Core CPUs 256GB Memory When initializing all disks and join them to the cluster, after approximately 30 OSDs, other osds are crashing. When I try to start them again I see different kinds of errors. For example: Starting Ceph osd.316 on ceph-osd-bs04...already running === osd.317 === Traceback (most recent call last): File /usr/bin/ceph, line 830, in module sys.exit(main()) File /usr/bin/ceph, line 773, in main sigdict, inbuf, verbose) File /usr/bin/ceph, line 420, in new_style_command inbuf=inbuf) File /usr/lib/python2.7/dist-packages/ceph_argparse.py, line 1112, in json_command raise RuntimeError('{0}: exception {1}'.format(cmd, e)) NameError: global name 'cmd' is not defined Exception thread.error: error(can't start new thread,) in bound method Rados.__del__ of rados.Rados object at 0x29ee410 ignored or: /etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 191: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 192: /etc/init.d/ceph: Cannot fork or: /usr/bin/ceph-crush-location: 72: /usr/bin/ceph-crush-location: Cannot fork /usr/bin/ceph-crush-location: 79: /usr/bin/ceph-crush-location: Cannot fork Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fcf768c9760 time 2014-09-12 15:00:28.284735 common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) 1: /usr/bin/ceph-conf() [0x51de8f] 2: (CephContext::CephContext(unsigned int)+0xb1) [0x520fe1] 3: (common_preinit(CephInitParameters const, code_environment_t, int)+0x48) [0x52eb78] 4: (global_pre_init(std::vectorchar const*, std::allocatorchar const* *, std::vectorchar const*, std::allocatorchar const* , unsigned int, code_environment_t, int)+0x8d) [0x518d0d] 5: (main()+0x17a) [0x514f6a] 6: (__libc_start_main()+0xfd) [0x7fcf7522ceed] 7: /usr/bin/ceph-conf() [0x5168d1] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) /etc/init.d/ceph: 340: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 1: /etc/init.d/ceph: Cannot fork Traceback (most recent call last): File /usr/bin/ceph, line 830, in module sys.exit(main()) File /usr/bin/ceph, line 590, in main conffile=conffile) File /usr/lib/python2.7/dist-packages/rados.py, line 198, in __init__ librados_path = find_library('rados') File /usr/lib/python2.7/ctypes/util.py, line 224, in find_library return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) File /usr/lib/python2.7/ctypes/util.py, line 213, in _findSoname_ldconfig f = os.popen('/sbin/ldconfig -p 2/dev/null') OSError: [Errno 12] Cannot allocate memory But anyways, when I look at the memory consumption of the system: # free -m total used free sharedbuffers cached Mem:258450 25841 232609 0 18 15506 -/+ buffers/cache: 10315 248135 Swap: 3811 0 3811 There are more then 230GB of memory available! What is going on there? System: Linux ceph-osd-bs04 3.14-0.bpo.1-amd64 #1 SMP Debian 3.14.12-1~bpo70+1 (2014-07-13) x86_64 GNU/Linux Since this is happening on other Hardware as well, I don't think it's Hardware related. I have no Idea if this is an OS issue (which would be seriously strange) or a ceph issue. Since this is happening only AFTER we upgraded to firefly, I guess it has something to do with ceph. ANY idea on what is going on here would be very appreciated! Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczew...@efigence.com mailto:mariusz.gronczew...@efigence.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] scrub error on firefly
I can also confirm that after upgrading to firefly both of our clusters (test and live) were going from 0 scrub errors each for about 6 Month to about 9-12 per week... This also makes me kind of nervous, since as far as I know everything ceph pg repair does, is to copy the primary object to all replicas, no matter which object is the correct one. Of course the described method of manual checking works (for pools with more than 2 replicas), but doing this in a large cluster nearly every week is horribly timeconsuming and error prone. It would be great to get an explanation for the increased numbers of scrub errors since firefly. Were they just not detected correctly in previous versions? Or is there maybe something wrong with the new code? Acutally, our company is currently preventing our projects to move to ceph because of this problem. Regards, Christian Von: ceph-users [ceph-users-boun...@lists.ceph.com] im Auftrag von Travis Rhoden [trho...@gmail.com] Gesendet: Donnerstag, 10. Juli 2014 16:24 An: Gregory Farnum Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] scrub error on firefly And actually just to follow-up, it does seem like there are some additional smarts beyond just using the primary to overwrite the secondaries... Since I captured md5 sums before and after the repair, I can say that in this particular instance, the secondary copy was used to overwrite the primary. So, I'm just trusting Ceph to the right thing, and so far it seems to, but the comments here about needing to determine the correct object and place it on the primary PG make me wonder if I've been missing something. - Travis On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden trho...@gmail.commailto:trho...@gmail.com wrote: I can also say that after a recent upgrade to Firefly, I have experienced massive uptick in scrub errors. The cluster was on cuttlefish for about a year, and had maybe one or two scrub errors. After upgrading to Firefly, we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a day for a few weeks until the whole cluster was rescrubbed, it seemed). What I cannot determine, however, is how to know which object is busted? For example, just today I ran into a scrub error. The object has two copies and is an 8MB piece of an RBD, and has identical timestamps, identical xattrs names and values. But it definitely has a different MD5 sum. How to know which one is correct? I've been just kicking off pg repair each time, which seems to just use the primary copy to overwrite the others. Haven't run into any issues with that so far, but it does make me nervous. - Travis On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum g...@inktank.commailto:g...@inktank.com wrote: It's not very intuitive or easy to look at right now (there are plans from the recent developer summit to improve things), but the central log should have output about exactly what objects are busted. You'll then want to compare the copies manually to determine which ones are good or bad, get the good copy on the primary (make sure you preserve xattrs), and run repair. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith rbsm...@adams.edumailto:rbsm...@adams.edu wrote: Greetings, I upgraded to firefly last week and I suddenly received this error: health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors ceph health detail shows the following: HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 3.c6 is active+clean+inconsistent, acting [2,5] 1 scrub errors The docs say that I can run `ceph pg repair 3.c6` to fix this. What I want to know is what are the risks of data loss if I run that command in this state and how can I mitigate them? -- Randall Smith Computing Services Adams State University http://www.adams.edu/ 719-587-7741tel:719-587-7741 ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] external monitoring tools for ceph
Hi all, if it should be nagios/icinga and not Zabbix, there is a remote check from me that can be found here: https://github.com/Crapworks/check_ceph_dash This one uses ceph-dash to monitor the overall cluster status via http: https://github.com/Crapworks/ceph-dash But it can be easily adopted to work together with ceph-rest-api since the outout is nearly the same. Regards, Christian Am 01.07.2014 10:24, schrieb Pierre BLONDEAU: Hi, May be you can use that : https://github.com/thelan/ceph-zabbix, but i am interested to view Craig's script and template. Regards Le 01/07/2014 10:16, Georgios Dimitrakakis a écrit : Hi Craig, I am also interested at the Zabbix templates and scripts if you can publish them. Regards, G. On Mon, 30 Jun 2014 18:15:12 -0700, Craig Lewis wrote: You should check out Calamari (https://github.com/ceph/calamari [3]), Inktanks monitoring and administration tool. I started before Calamari was announced, so I rolled my own using using Zabbix. It handles all the monitoring, graphing, and alerting in one tool. Its kind of a pain to setup, but works ok now that its going. I dont know how to handle the cluster view though. Im monitoring individual machines. Whenever something happens, like an OSD stops responding, I get an alert from every monitor. Otherwise its not a big deal. Im in the middle of re-factoring the data gathering from poll to push. If youre interested, I can publish my templates and scripts when Im done. On Sun, Jun 29, 2014 at 1:17 AM, pragya jain wrote: Hello all, I am working on ceph storage cluster with rados gateway for object storage. I am looking for external monitoring tools that can be used to monitor ceph storage cluster and rados gateway interface. I find various monitoring tools, such as nagios, collectd, ganglia, diamond, sensu, logstash. but i dont get details of anyone about what features do these monitoring tools monitor in ceph. Has somebody implemented anyone of these tools? Can somebody help me in identifying the features provided by these tools? Is there any other tool which can also be used to monitor ceph specially for object storage? Regards Pragya Jain ___ ceph-users mailing list ceph-users@lists.ceph.com [1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] Links: -- [1] mailto:ceph-users@lists.ceph.com [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] https://github.com/ceph/calamari [4] mailto:prag_2...@yahoo.co.in ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Behaviour of ceph pg repair on different replication levels
Hi ceph users, since our cluster had a few inconsistent pgs in the last time, i was wondering what ceph pg repair does, depending on the replication level. So I just wanted to check if my assumptions are correct: Replication 2x Since the cluster can not decide which version is correct one, it would just copy the primary copy (the active one) over the secondary copy. Which is a 50/50 chance to get the correct version. Replication 3x or more Now the cluster has a quorum and a ceph pg repair will replace the corrupt replica with one of the correct one. No manual intervention needed. Am I on the right way? Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG Scrub Error / active+clean+inconsistent
Hi all, after coming back from a long weekend, I found my production cluster in an error state, mentioning 6 scrub errors and 6 pg's in active+clean+inconsistent state. Strange is, that my Prelive-Cluster, running on different Hardware, are also showing 1 scrub error and 1 inconsisten pg... pg dump shows that 6 different OSD's are affected. I will check again for some Hardware Errors, but since the hardware is quite new, and none of our monitoring checks found disk errors, I'm not sure about it. What can be the cause of such a problem? And, what is also interesting, how to recover from it? :) Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG Scrub Error / active+clean+inconsistent
Hi again, just found the ceph pg repair command :) Now both clusters are OK again. Anyways, I'm really interested in the caus of the problem. Regards, Christian Am 10.06.2014 10:28, schrieb Christian Eichelmann: Hi all, after coming back from a long weekend, I found my production cluster in an error state, mentioning 6 scrub errors and 6 pg's in active+clean+inconsistent state. Strange is, that my Prelive-Cluster, running on different Hardware, are also showing 1 scrub error and 1 inconsisten pg... pg dump shows that 6 different OSD's are affected. I will check again for some Hardware Errors, but since the hardware is quite new, and none of our monitoring checks found disk errors, I'm not sure about it. What can be the cause of such a problem? And, what is also interesting, how to recover from it? :) Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Nagios Check for Ceph-Dash
Hi Folks! For those of you, who are using ceph-dash (https://github.com/Crapworks/ceph-dash), I've created a Nagios-Plugin, that uses the json endpoint to monitor your cluster remotely: * https://github.com/Crapworks/check_ceph_dash I think this can be easily adopted to use the ceph-rest-api as well. Since ceph-dash is completely read-only, there are less security considerations about exposing this api to your monitoring system. Any feedback is welcome! Regards, Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] visualizing a ceph cluster automatically
I have written a small and lightweight gui, which can also acts as a json rest api (for non-interactive monitoring): https://github.com/Crapworks/ceph-dash Maybe thats what you searching for. Regards, Christian Von: ceph-users [ceph-users-boun...@lists.ceph.com] im Auftrag von Drew Weaver [drew.wea...@thenap.com] Gesendet: Freitag, 16. Mai 2014 14:01 An: 'ceph-users@lists.ceph.com' Betreff: [ceph-users] visualizing a ceph cluster automatically Does anyone know of any tools that help you visually monitor a ceph cluster automatically? Something that is host, osd, mon aware and shows various status of components, etc? Thanks, -Drew ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com