Re: [ceph-users] NFS interaction with RBD
Christian Schnidrig writes: > Well that’s strange. I wonder why our systems behave so differently. One point about our cluster (I work with Christian, who's still on vacation, and Jens-Christian) is that it has 124 OSDs and 2048 PGs (I think) in the pool used for these RBD volumes. As a result, each connected RBD volume can result in 124 (or slightly less) connections from the RBD client inside Qemu/KVM to each OSD that stores data from that RBD volume. I don't know how librbd's connection management works. I assume that these librbd-to-OSD connections are only created once the client actually tries to access data on that OSD. But when you have a lot of data on the RBD volumes that the VM actually accesses (which we have), then these many connections will actually be created. And apparently librbd doesn't handle the situation very gracefully when its process runs into the limit of open file descriptors. George only has 20 OSDs, so I guess that's an upper bound on the number of TCP connections that librbd will open per RBD volume. He should be safe up to about 50 volumes per VM, assuming the default nfiles limit of 1024. The nasty thing is when everything has been running fine for ages, and then you add a bunch of OSDs, run a few benchmarks, see that everything should run much BETTER (as promised :-), but then suddenly some VMs with lots of mounted volumes mysteriously start hanging. > Maybe the number of placement groups plays a major role as > well. Jens-Christian may be able to give you the specifics of our ceph > cluster. Me too, see above. > I’m about to leave on vacation and don’t have time to look that up > anymore. Enjoy your well-earned vacation!! -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS interaction with RBD
Trent Lloyd writes: > Jens-Christian Fischer writes: >> >> I think we (i.e. Christian) found the problem: >> We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as > he hit all disks, we started to experience these 120 second timeouts. We > realized that the QEMU process on the hypervisor is opening a TCP connection > to every OSD for every mounted volume - exceeding the 1024 FD limit. >> >> So no deep scrubbing etc, but simply to many connections… > Have seen mention of similar from CERN in their presentations, found this > post on a quick google.. might help? > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/026187.html Yes, that's exactly the problem that we had. We solved it by setting max_files to 8191 in /etc/libvirt/qemu.conf on all compute hosts. Once that was applied, we were able to live-migrate running instances for them to enjoy the increased limit. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS interaction with RBD
Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer wrote: > George, > > I will let Christian provide you the details. As far as I know, it was enough > to just do a ‘ls’ on all of the attached drives. > > we are using Qemu 2.0: > > $ dpkg -l | grep qemu > ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 > all PXE boot firmware - ROM images for qemu > ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 > all QEMU keyboard maps > ii qemu-system 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries > ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (arm) > ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (common files) > ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (mips) > ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (miscelaneous) > ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (ppc) > ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (sparc) > ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU full system emulation binaries (x86) > ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 > amd64QEMU utilities > > cheers > jc > > -- > SWITCH > Jens-Christian Fischer, Peta Solutions > Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland > phone +41 44 268 15 15, direct +41 44 268 15 71 > jens-christian.fisc...@switch.ch > http://www.switch.ch > > http://www.switch.ch/stories > > On 26.05.2015, at 19:12, Georgios Dimitrakakis wrote: > >> Jens-Christian, >> >> how did you test that? Did you just tried to write to them simultaneously? >> Any other tests that one can perform to verify that? >> >> In our installation we have a VM with 30 RBD volumes mounted which are all >> exported via NFS to other VMs. >> No one has complaint for the moment but the load/usage is very minimal. >> If this problem really exists then very soon that the trial phase will be >> over we will have millions of complaints :-( >> >> What version of QEMU are you using? We are using the one provided by Ceph in >> qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm >> >> Best regards, >> >> George >> >>> I think we (i.e. Christian) found the problem: >>> >>> We created a test VM with 9 mounted RBD volumes (no NFS server). As >>> soon as he hit all disks, we started to experience these 120 second >>> timeouts. We realized that the QEMU process on the hypervisor is >>> opening a TCP connection to every OSD for every mounted volume - >>> exceeding the 1024 FD limit. >>> >>> So no deep scrubbing etc, but simply to many connections… >>> >>> cheers >>> jc >>> >>> -- >>> SWITCH >>> Jens-Christian Fischer, Peta Solutions >>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland >>> phone +41 44 268 15 15, direct +41 44 268 15 71 >>> jens-christian.fisc...@switch.ch [3] >>> http://www.switch.ch >>> >>> http://www.switch.ch/stories >>> >>> On 25.05.2015, at 06:02, Christian Balzer wrote: >>> Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is your network, CPU. Running something like collectd to gather all ceph perf data and other data from the storage nodes and then feeding it to graphite (or similar) can be VERY helpful to identify if something is going wrong and what it is in particular. Otherwise run atop on your
Re: [ceph-users] NFS interaction with RBD
Hi George Well that’s strange. I wonder why our systems behave so differently. We’ve got: Hypervisors running on Ubuntu 14.04. VMs with 9 ceph volumes: 2TB each. XFS instead of your ext4 Maybe the number of placement groups plays a major role as well. Jens-Christian may be able to give you the specifics of our ceph cluster. I’m about to leave on vacation and don’t have time to look that up anymore. Best regards Christian On 29 May 2015, at 14:42, Georgios Dimitrakakis wrote: > All, > > I 've tried to recreate the issue without success! > > My configuration is the following: > > OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64) > QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 > Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB > OSDs equally distributed on two disk nodes, 3xMonitors > > > OpenStack Cinder has been configured to provide RBD Volumes from Ceph. > > I have created 10x 500GB Volumes which were then all attached at a single > Virtual Machine. > > All volumes were formatted two times for comparison reasons, one using > "mkfs.xfs" and one using "mkfs.ext4". > I did try to issue the commands all at the same time (or as possible to that). > > In both tests I didn't notice any interruption. It may took longer than just > doing one at a time but the system was continuously up and everything was > responding without the problem. > > At the time of these processes the open connections were 100 with one of the > OSD node and 111 with the other one. > > So I guess I am not experiencing the issue due to the low number of OSDs I am > having. Is my assumption correct? > > > Best regards, > > George > > > >> Thanks a million for the feedback Christian! >> >> I 've tried to recreate the issue with 10RBD Volumes mounted on a >> single server without success! >> >> I 've issued the "mkfs.xfs" command simultaneously (or at least as >> fast I could do it in different terminals) without noticing any >> problems. Can you please tell me what was the size of each one of the >> RBD Volumes cause I have a feeling that mine were two small, and if so >> I have to test it on our bigger cluster. >> >> I 've also thought that besides QEMU version it might also be >> important the underlying OS, so what was your testbed? >> >> >> All the best, >> >> George >> >>> Hi George >>> >>> In order to experience the error it was enough to simply run mkfs.xfs >>> on all the volumes. >>> >>> >>> In the meantime it became clear what the problem was: >>> >>> ~ ; cat /proc/183016/limits >>> ... >>> Max open files1024 4096 files >>> .. >>> >>> This can be changed by setting a decent value in >>> /etc/libvirt/qemu.conf for max_files. >>> >>> Regards >>> Christian >>> >>> >>> >>> On 27 May 2015, at 16:23, Jens-Christian Fischer >>> wrote: >>> George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis wrote: > Jens-Christian, > > how did you
Re: [ceph-users] NFS interaction with RBD
In the end this came down to one slow OSD. There were no hardware issues so have to just assume something gummed up during rebalancing and peering. I restarted the osd process after setting the cluster to noout. After the osd was restarted the rebalance completed and the cluster returned to health ok. As soon as the osd restarted all previously hanging operations returned to normal. I'm surprised by a single slow OSD impacting access to the entire cluster. I understand now that only the primary osd is used for reads and writes must go to the primary then secondary, but I would have expected the impact to be more contained. We currently build XFS file systems directly on RBD images. I'm wondering if there would be any value in using an LVM abstraction on top to spread access to other osds for read and failure scenarios. Any thoughts on the above appreciated. ~jpr On 05/28/2015 03:18 PM, John-Paul Robinson wrote: > To follow up on the original post, > > Further digging indicates this is a problem with RBD image access and > is not related to NFS-RBD interaction as initially suspected. The > nfsd is simply hanging as a result of a hung request to the XFS file > system mounted on our RBD-NFS gateway.This hung XFS call is caused > by a problem with the RBD module interacting with our Ceph pool. > > I've found a reliable way to trigger a hang directly on an rbd image > mapped into our RBD-NFS gateway box. The image contains an XFS file > system. When I try to list the contents of a particular directory, > the request hangs indefinitely. > > Two weeks ago our ceph status was: > > jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status >health HEALTH_WARN 1 near full osd(s) >monmap e1: 3 mons at > > {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, > election epoch 350, quorum 0,1,2 > da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 >osdmap e5978: 66 osds: 66 up, 66 in > pgmap v26434260: 3072 pgs: 3062 active+clean, 6 > active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB > data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s >mdsmap e1: 0/0/1 up > > > The near full osd was number 53 and we updated our crush map to > rewieght the osd. All of the OSDs had a weight of 1 based on the > assumption that all osds were 2.0TB. Apparently one of our severs had > the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough > we are only at 50% utilization. We reweighted the near full osd to .8 > and that initiated a rebalance that has since relieved the 95% full > condition on that OSD. > > However, since that time the repeering has not completed and we > suspect this is causing problems with our access of RBD images. Our > current ceph status is: > > jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status >health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs > stuck unclean; recovery 9/23842120 degraded (0.000%) >monmap e1: 3 mons at > > {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, > election epoch 350, quorum 0,1,2 > da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 >osdmap e6036: 66 osds: 66 up, 66 in > pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9 > active+clean+scrubbing, 1 remapped+peering, 3 > active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 > GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%) >mdsmap e1: 0/0/1 up > > > Here are further details on our stuck pgs: > > jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg > dump_stuck inactive > ok > pg_stat objects mip degrunf bytes log disklog > state state_stamp v reportedup acting > last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp > 3.3af 11600 0 0 0 47941791744 153812 > 153812 remapped+peering2015-05-15 12:47:17.223786 > 5979'293066 6000'1248735 [48,62] [53,48,62] > 5979'293056 2015-05-15 07:40:36.275563 5979'293056 > 2015-05-15 07:40:36.275563 > > jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg > dump_stuck unclean > ok > pg_stat objects mip degrunf bytes log disklog > state state_stamp v reportedup acting > last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp > 3.106 11870 0 9 0 49010106368 163991 > 163991 active 2015-05-15 12:47:19.761469 6035'356332 > 5968'1358516 [62,53] [62,53] 5979'356242 2015-05-14 > 22:22:12.966150 5979'351351 2015-05-12 18:04:41.838686 > 5.104 0 0 0 0 0 0 0
Re: [ceph-users] NFS interaction with RBD
All, I 've tried to recreate the issue without success! My configuration is the following: OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64) QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB OSDs equally distributed on two disk nodes, 3xMonitors OpenStack Cinder has been configured to provide RBD Volumes from Ceph. I have created 10x 500GB Volumes which were then all attached at a single Virtual Machine. All volumes were formatted two times for comparison reasons, one using "mkfs.xfs" and one using "mkfs.ext4". I did try to issue the commands all at the same time (or as possible to that). In both tests I didn't notice any interruption. It may took longer than just doing one at a time but the system was continuously up and everything was responding without the problem. At the time of these processes the open connections were 100 with one of the OSD node and 111 with the other one. So I guess I am not experiencing the issue due to the low number of OSDs I am having. Is my assumption correct? Best regards, George Thanks a million for the feedback Christian! I 've tried to recreate the issue with 10RBD Volumes mounted on a single server without success! I 've issued the "mkfs.xfs" command simultaneously (or at least as fast I could do it in different terminals) without noticing any problems. Can you please tell me what was the size of each one of the RBD Volumes cause I have a feeling that mine were two small, and if so I have to test it on our bigger cluster. I 've also thought that besides QEMU version it might also be important the underlying OS, so what was your testbed? All the best, George Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to
Re: [ceph-users] NFS interaction with RBD
Thanks a million for the feedback Christian! I 've tried to recreate the issue with 10RBD Volumes mounted on a single server without success! I 've issued the "mkfs.xfs" command simultaneously (or at least as fast I could do it in different terminals) without noticing any problems. Can you please tell me what was the size of each one of the RBD Volumes cause I have a feeling that mine were two small, and if so I have to test it on our bigger cluster. I 've also thought that besides QEMU version it might also be important the underlying OS, so what was your testbed? All the best, George Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch [3] http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote: Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is your network, CPU. Running something like collectd to
Re: [ceph-users] NFS interaction with RBD
To follow up on the original post, Further digging indicates this is a problem with RBD image access and is not related to NFS-RBD interaction as initially suspected. The nfsd is simply hanging as a result of a hung request to the XFS file system mounted on our RBD-NFS gateway.This hung XFS call is caused by a problem with the RBD module interacting with our Ceph pool. I've found a reliable way to trigger a hang directly on an rbd image mapped into our RBD-NFS gateway box. The image contains an XFS file system. When I try to list the contents of a particular directory, the request hangs indefinitely. Two weeks ago our ceph status was: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 near full osd(s) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e5978: 66 osds: 66 up, 66 in pgmap v26434260: 3072 pgs: 3062 active+clean, 6 active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s mdsmap e1: 0/0/1 up The near full osd was number 53 and we updated our crush map to rewieght the osd. All of the OSDs had a weight of 1 based on the assumption that all osds were 2.0TB. Apparently one of our severs had the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough we are only at 50% utilization. We reweighted the near full osd to .8 and that initiated a rebalance that has since relieved the 95% full condition on that OSD. However, since that time the repeering has not completed and we suspect this is causing problems with our access of RBD images. Our current ceph status is: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck unclean; recovery 9/23842120 degraded (0.000%) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e6036: 66 osds: 66 up, 66 in pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9 active+clean+scrubbing, 1 remapped+peering, 3 active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%) mdsmap e1: 0/0/1 up Here are further details on our stuck pgs: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck inactive ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck unclean ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.106 11870 0 9 0 49010106368 163991 163991 active 2015-05-15 12:47:19.761469 6035'356332 5968'1358516 [62,53] [62,53] 5979'356242 2015-05-14 22:22:12.966150 5979'351351 2015-05-12 18:04:41.838686 5.104 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.800676 0'0 5968'1615 [62,53] [62,53] 0'0 2015-05-14 18:43:22.425105 0'0 2015-05-08 10:19:54.938934 4.105 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.801028 0'0 5968'1615 [62,53] [62,53] 0'0 2015-05-14 18:43:04.434826 0'0 2015-05-14 18:43:04.434826 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 The servers in the pool are not overloaded. On the ceph server that originally had the nearly full osd, (osd 53), I'm seeing entries like this in the osd log: 2015-05-28 06:25:02.900129 7f2ea8a4f700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 1096430.805069 secs 2015-05-28 06:25:02.900145 7f2ea8a4f700 0 log [WRN] : slow request 1096
Re: [ceph-users] NFS interaction with RBD
Jens-Christian Fischer writes: > > I think we (i.e. Christian) found the problem: > We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. > > So no deep scrubbing etc, but simply to many connections… Have seen mention of similar from CERN in their presentations, found this post on a quick google.. might help? http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013- December/026187.html Cheers, Trent ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS interaction with RBD
George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis wrote: > Jens-Christian, > > how did you test that? Did you just tried to write to them simultaneously? > Any other tests that one can perform to verify that? > > In our installation we have a VM with 30 RBD volumes mounted which are all > exported via NFS to other VMs. > No one has complaint for the moment but the load/usage is very minimal. > If this problem really exists then very soon that the trial phase will be > over we will have millions of complaints :-( > > What version of QEMU are you using? We are using the one provided by Ceph in > qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm > > Best regards, > > George > >> I think we (i.e. Christian) found the problem: >> >> We created a test VM with 9 mounted RBD volumes (no NFS server). As >> soon as he hit all disks, we started to experience these 120 second >> timeouts. We realized that the QEMU process on the hypervisor is >> opening a TCP connection to every OSD for every mounted volume - >> exceeding the 1024 FD limit. >> >> So no deep scrubbing etc, but simply to many connections… >> >> cheers >> jc >> >> -- >> SWITCH >> Jens-Christian Fischer, Peta Solutions >> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland >> phone +41 44 268 15 15, direct +41 44 268 15 71 >> jens-christian.fisc...@switch.ch [3] >> http://www.switch.ch >> >> http://www.switch.ch/stories >> >> On 25.05.2015, at 06:02, Christian Balzer wrote: >> >>> Hello, >>> >>> lets compare your case with John-Paul's. >>> >>> Different OS and Ceph versions (thus we can assume different NFS >>> versions >>> as well). >>> The only common thing is that both of you added OSDs and are likely >>> suffering from delays stemming from Ceph re-balancing or >>> deep-scrubbing. >>> >>> Ceph logs will only pipe up when things have been blocked for more >>> than 30 >>> seconds, NFS might take offense to lower values (or the accumulation >>> of >>> several distributed delays). >>> >>> You added 23 OSDs, tell us more about your cluster, HW, network. >>> Were these added to the existing 16 nodes, are these on new storage >>> nodes >>> (so could there be something different with those nodes?), how busy >>> is your >>> network, CPU. >>> Running something like collectd to gather all ceph perf data and >>> other >>> data from the storage nodes and then feeding it to graphite (or >>> similar) >>> can be VERY helpful to identify if something is going wrong and what >>> it is >>> in particular. >>> Otherwise run atop on your storage nodes to identify if CPU, >>> network, >>> specific HDDs/OSDs are bottlenecks. >>> >>> Deep scrubbing can be _very_ taxing, do your problems persist if >>> inject >>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower >>> that >>> until it hurts again) or if you turn of deep scrubs altogether for >>> the >>> moment? >>> >>> Christian >>> >>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: >>> We see something very similar on our Ceph cluster, starting as of today. We use a 16 node, 102 OSD Ceph installation as the basis for an >>>
Re: [ceph-users] NFS interaction with RBD
Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch [3] http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote: Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is your network, CPU. Running something like collectd to gather all ceph perf data and other data from the storage nodes and then feeding it to graphite (or similar) can be VERY helpful to identify if something is going wrong and what it is in particular. Otherwise run atop on your storage nodes to identify if CPU, network, specific HDDs/OSDs are bottlenecks. Deep scrubbing can be _very_ taxing, do your problems persist if inject into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that until it hurts again) or if you turn of deep scrubs altogether for the moment? Christian On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: We see something very similar on our Ceph cluster, starting as of today. We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse OpenStack cluster (we applied the RBD patches for live migration etc) On this cluster we have a big ownCloud installation (Sync & Share) that stores its files on three NFS servers, each mounting 6 2TB RBD volumes and exposing them to around 10 web server VMs (we originally started with one NFS server with a 100TB volume, but that has become unwieldy). All of the servers (hypervisors, ceph storage nodes and VMs) are using Ubuntu 14.04 Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 OSDs (because we had 4 OSDs that were nearing the 90% full mark). The rebalancing process ended this morning (after around 12 hours) The cluster has been clean since then: cluster b1f3f4c8-x health HEALTH_OK monmap e2: 3 mons at {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0}, election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools, 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319 active+clean 17 active+clean+scrubbing+deep client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s At midnight, we run a script that creates an RBD snapshot of all RBD volumes that are attached to the NFS servers (for backup purposes). Looking at our monitoring, around that time, one of the NFS servers became unresponsive and took down the complete ownCloud installation (load on the web server was > 200 and they had lost some of the NFS mounts) Rebooting the NFS server solved that problem, but the NFS kernel server kept crashing all day long after having run between 10 to 90 minutes. We initially suspected a corrupt rbd volume (as it seemed that we could trigger the kernel crash by just “ls -l” one of the volumes, but subsequent “xfs_repair -n” checks on those RBD volumes showed no problems. We migrated the NFS server off of its hypervisor, suspecting a problem with RBD kernel modules, rebooted the hypervisor but the problem persisted (both on the new hypervisor, and on the old one when we migrated it back) We changed the /etc/default/nfs-kernel-server to start up 256 servers (even though the defaults had been working fine for over a year) O
Re: [ceph-users] NFS interaction with RBD
I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote: > > Hello, > > lets compare your case with John-Paul's. > > Different OS and Ceph versions (thus we can assume different NFS versions > as well). > The only common thing is that both of you added OSDs and are likely > suffering from delays stemming from Ceph re-balancing or deep-scrubbing. > > Ceph logs will only pipe up when things have been blocked for more than 30 > seconds, NFS might take offense to lower values (or the accumulation of > several distributed delays). > > You added 23 OSDs, tell us more about your cluster, HW, network. > Were these added to the existing 16 nodes, are these on new storage nodes > (so could there be something different with those nodes?), how busy is your > network, CPU. > Running something like collectd to gather all ceph perf data and other > data from the storage nodes and then feeding it to graphite (or similar) > can be VERY helpful to identify if something is going wrong and what it is > in particular. > Otherwise run atop on your storage nodes to identify if CPU, network, > specific HDDs/OSDs are bottlenecks. > > Deep scrubbing can be _very_ taxing, do your problems persist if inject > into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that > until it hurts again) or if you turn of deep scrubs altogether for the > moment? > > Christian > > On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: > >> We see something very similar on our Ceph cluster, starting as of today. >> >> We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse >> OpenStack cluster (we applied the RBD patches for live migration etc) >> >> On this cluster we have a big ownCloud installation (Sync & Share) that >> stores its files on three NFS servers, each mounting 6 2TB RBD volumes >> and exposing them to around 10 web server VMs (we originally started >> with one NFS server with a 100TB volume, but that has become unwieldy). >> All of the servers (hypervisors, ceph storage nodes and VMs) are using >> Ubuntu 14.04 >> >> Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 >> OSDs (because we had 4 OSDs that were nearing the 90% full mark). The >> rebalancing process ended this morning (after around 12 hours) The >> cluster has been clean since then: >> >>cluster b1f3f4c8-x >> health HEALTH_OK >> monmap e2: 3 mons at >> {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0}, >> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap >> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools, >> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319 >> active+clean 17 active+clean+scrubbing+deep >> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s >> >> At midnight, we run a script that creates an RBD snapshot of all RBD >> volumes that are attached to the NFS servers (for backup purposes). >> Looking at our monitoring, around that time, one of the NFS servers >> became unresponsive and took down the complete ownCloud installation >> (load on the web server was > 200 and they had lost some of the NFS >> mounts) >> >> Rebooting the NFS server solved that problem, but the NFS kernel server >> kept crashing all day long after having run between 10 to 90 minutes. >> >> We initially suspected a corrupt rbd volume (as it seemed that we could >> trigger the kernel crash by just “ls -l” one of the volumes, but >> subsequent “xfs_repair -n” checks on those RBD volumes showed no >> problems. >> >> We migrated the NFS server off of its hypervisor, suspecting a problem >> with RBD kernel modules, rebooted the hypervisor but the problem >> persisted (both on the new hypervisor, and on the old one when we >> migrated it back) >> >> We changed the /etc/default/nfs-kernel-server to start up 256 servers >> (even though the defaults had been working fine for over a year) >> >> Only one of our 3 NFS servers crashes (see below for syslog information) >> - the other 2 have been fine >> >> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: >> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May >> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting 90-second >> grace period (net 81cdab00) May 23 21:44:23 dri
Re: [ceph-users] NFS interaction with RBD
Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is your network, CPU. Running something like collectd to gather all ceph perf data and other data from the storage nodes and then feeding it to graphite (or similar) can be VERY helpful to identify if something is going wrong and what it is in particular. Otherwise run atop on your storage nodes to identify if CPU, network, specific HDDs/OSDs are bottlenecks. Deep scrubbing can be _very_ taxing, do your problems persist if inject into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that until it hurts again) or if you turn of deep scrubs altogether for the moment? Christian On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: > We see something very similar on our Ceph cluster, starting as of today. > > We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse > OpenStack cluster (we applied the RBD patches for live migration etc) > > On this cluster we have a big ownCloud installation (Sync & Share) that > stores its files on three NFS servers, each mounting 6 2TB RBD volumes > and exposing them to around 10 web server VMs (we originally started > with one NFS server with a 100TB volume, but that has become unwieldy). > All of the servers (hypervisors, ceph storage nodes and VMs) are using > Ubuntu 14.04 > > Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 > OSDs (because we had 4 OSDs that were nearing the 90% full mark). The > rebalancing process ended this morning (after around 12 hours) The > cluster has been clean since then: > > cluster b1f3f4c8-x > health HEALTH_OK > monmap e2: 3 mons at > {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0}, > election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap > e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools, > 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319 > active+clean 17 active+clean+scrubbing+deep > client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s > > At midnight, we run a script that creates an RBD snapshot of all RBD > volumes that are attached to the NFS servers (for backup purposes). > Looking at our monitoring, around that time, one of the NFS servers > became unresponsive and took down the complete ownCloud installation > (load on the web server was > 200 and they had lost some of the NFS > mounts) > > Rebooting the NFS server solved that problem, but the NFS kernel server > kept crashing all day long after having run between 10 to 90 minutes. > > We initially suspected a corrupt rbd volume (as it seemed that we could > trigger the kernel crash by just “ls -l” one of the volumes, but > subsequent “xfs_repair -n” checks on those RBD volumes showed no > problems. > > We migrated the NFS server off of its hypervisor, suspecting a problem > with RBD kernel modules, rebooted the hypervisor but the problem > persisted (both on the new hypervisor, and on the old one when we > migrated it back) > > We changed the /etc/default/nfs-kernel-server to start up 256 servers > (even though the defaults had been working fine for over a year) > > Only one of our 3 NFS servers crashes (see below for syslog information) > - the other 2 have been fine > > May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: > Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May > 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting 90-second > grace period (net 81cdab00) May 23 21:44:23 drive-nfs1 > rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1 > kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May > 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version 0.5.0 > (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel: > [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23 > 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 > > /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1 > > collectd[1872]: python: Plugin loaded but not configured. May 23 > > 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering > > read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] init: > > plymouth-upstart-bridge main process ended, respawning May 23 21:51:26 > > drive-nfs1 kernel: [ 600.776177] INFO:
Re: [ceph-users] NFS interaction with RBD
We see something very similar on our Ceph cluster, starting as of today. We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse OpenStack cluster (we applied the RBD patches for live migration etc) On this cluster we have a big ownCloud installation (Sync & Share) that stores its files on three NFS servers, each mounting 6 2TB RBD volumes and exposing them to around 10 web server VMs (we originally started with one NFS server with a 100TB volume, but that has become unwieldy). All of the servers (hypervisors, ceph storage nodes and VMs) are using Ubuntu 14.04 Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 OSDs (because we had 4 OSDs that were nearing the 90% full mark). The rebalancing process ended this morning (after around 12 hours) The cluster has been clean since then: cluster b1f3f4c8-x health HEALTH_OK monmap e2: 3 mons at {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0}, election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools, 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319 active+clean 17 active+clean+scrubbing+deep client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s At midnight, we run a script that creates an RBD snapshot of all RBD volumes that are attached to the NFS servers (for backup purposes). Looking at our monitoring, around that time, one of the NFS servers became unresponsive and took down the complete ownCloud installation (load on the web server was > 200 and they had lost some of the NFS mounts) Rebooting the NFS server solved that problem, but the NFS kernel server kept crashing all day long after having run between 10 to 90 minutes. We initially suspected a corrupt rbd volume (as it seemed that we could trigger the kernel crash by just “ls -l” one of the volumes, but subsequent “xfs_repair -n” checks on those RBD volumes showed no problems. We migrated the NFS server off of its hypervisor, suspecting a problem with RBD kernel modules, rebooted the hypervisor but the problem persisted (both on the new hypervisor, and on the old one when we migrated it back) We changed the /etc/default/nfs-kernel-server to start up 256 servers (even though the defaults had been working fine for over a year) Only one of our 3 NFS servers crashes (see below for syslog information) - the other 2 have been fine May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting 90-second grace period (net 81cdab00) May 23 21:44:23 drive-nfs1 rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1 kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel: [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1 collectd[1872]: python: Plugin loaded but not configured. May 23 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] init: plymouth-upstart-bridge main process ended, respawning May 23 21:51:26 drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked for more than 120 seconds. May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel: [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 23 21:51:26 drive-nfs1 kernel: [ 600.781504] nfsdD 88013fd93180 0 1696 2 0x May 23 21:51:26 drive-nfs1 kernel: [ 600.781508] 8800b2391c50 0046 8800b22f9800 8800b2391fd8 May 23 21:51:26 drive-nfs1 kernel: [ 600.781511] 00013180 00013180 8800b22f9800 880035f48240 May 23 21:51:26 drive-nfs1 kernel: [ 600.781513] 880035f48244 8800b22f9800 880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515] Call Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523] [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26 drive-nfs1 kernel: [ 600.781526] [] __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel: [ 600.781528] [] mutex_lock+0x1f/0x2f May 23 21:51:26 drive-nfs1 kernel: [ 600.781557] [] nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781591] [] nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781613] [] nfsd