subject:"Re\: \[ceph\-users\] NFS interaction with RBD"

Re: [ceph-users] NFS interaction with RBD

2015-06-15 Thread Simon Leinen

Christian Schnidrig writes:
> Well that’s strange. I wonder why our systems behave so differently.

One point about our cluster (I work with Christian, who's still on
vacation, and Jens-Christian) is that it has 124 OSDs and 2048 PGs (I
think) in the pool used for these RBD volumes.  As a result, each
connected RBD volume can result in 124 (or slightly less) connections
from the RBD client inside Qemu/KVM to each OSD that stores data from
that RBD volume.

I don't know how librbd's connection management works.  I assume that
these librbd-to-OSD connections are only created once the client
actually tries to access data on that OSD.  But when you have a lot of
data on the RBD volumes that the VM actually accesses (which we have),
then these many connections will actually be created.  And apparently
librbd doesn't handle the situation very gracefully when its process
runs into the limit of open file descriptors.

George only has 20 OSDs, so I guess that's an upper bound on the number
of TCP connections that librbd will open per RBD volume.  He should be
safe up to about 50 volumes per VM, assuming the default nfiles limit of
1024.

The nasty thing is when everything has been running fine for ages, and
then you add a bunch of OSDs, run a few benchmarks, see that everything
should run much BETTER (as promised :-), but then suddenly some VMs with
lots of mounted volumes mysteriously start hanging.

> Maybe the number of placement groups plays a major role as
> well. Jens-Christian may be able to give you the specifics of our ceph
> cluster.

Me too, see above.

> I’m about to leave on vacation and don’t have time to look that up
> anymore.

Enjoy your well-earned vacation!!
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS interaction with RBD

2015-06-15 Thread Simon Leinen

Trent Lloyd writes:
> Jens-Christian Fischer  writes:
>> 
>> I think we (i.e. Christian) found the problem:
>> We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as 
> he hit all disks, we started to experience these 120 second timeouts. We 
> realized that the QEMU process on the hypervisor is opening a TCP connection 
> to every OSD for every mounted volume - exceeding the 1024 FD limit.
>> 
>> So no deep scrubbing etc, but simply to many connections…

> Have seen mention of similar from CERN in their presentations, found this 
> post on a quick google.. might help?

> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/026187.html

Yes, that's exactly the problem that we had.  We solved it by setting
max_files to 8191 in /etc/libvirt/qemu.conf on all compute hosts.

Once that was applied, we were able to live-migrate running instances
for them to enjoy the increased limit.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS interaction with RBD

2015-06-11 Thread Christian Schnidrig

Hi George

In order to experience the error it was enough to simply run mkfs.xfs on all 
the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 files
..

This can be changed by setting a decent value in /etc/libvirt/qemu.conf for 
max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer 
 wrote:

> George,
> 
> I will let Christian provide you the details. As far as I know, it was enough 
> to just do a ‘ls’ on all of the attached drives.
> 
> we are using Qemu 2.0:
> 
> $ dpkg -l | grep qemu
> ii  ipxe-qemu   1.0.0+git-2013.c3d1e78-2ubuntu1   
> all  PXE boot firmware - ROM images for qemu
> ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11
> all  QEMU keyboard maps
> ii  qemu-system 2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries
> ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (arm)
> ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (common files)
> ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (mips)
> ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (miscelaneous)
> ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (ppc)
> ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (sparc)
> ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11
> amd64QEMU full system emulation binaries (x86)
> ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11
> amd64QEMU utilities
> 
> cheers
> jc
> 
> -- 
> SWITCH
> Jens-Christian Fischer, Peta Solutions
> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
> phone +41 44 268 15 15, direct +41 44 268 15 71
> jens-christian.fisc...@switch.ch
> http://www.switch.ch
> 
> http://www.switch.ch/stories
> 
> On 26.05.2015, at 19:12, Georgios Dimitrakakis  wrote:
> 
>> Jens-Christian,
>> 
>> how did you test that? Did you just tried to write to them simultaneously? 
>> Any other tests that one can perform to verify that?
>> 
>> In our installation we have a VM with 30 RBD volumes mounted which are all 
>> exported via NFS to other VMs.
>> No one has complaint for the moment but the load/usage is very minimal.
>> If this problem really exists then very soon that the trial phase will be 
>> over we will have millions of complaints :-(
>> 
>> What version of QEMU are you using? We are using the one provided by Ceph in 
>> qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
>> 
>> Best regards,
>> 
>> George
>> 
>>> I think we (i.e. Christian) found the problem:
>>> 
>>> We created a test VM with 9 mounted RBD volumes (no NFS server). As
>>> soon as he hit all disks, we started to experience these 120 second
>>> timeouts. We realized that the QEMU process on the hypervisor is
>>> opening a TCP connection to every OSD for every mounted volume -
>>> exceeding the 1024 FD limit.
>>> 
>>> So no deep scrubbing etc, but simply to many connections…
>>> 
>>> cheers
>>> jc
>>> 
>>> --
>>> SWITCH
>>> Jens-Christian Fischer, Peta Solutions
>>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>>> phone +41 44 268 15 15, direct +41 44 268 15 71
>>> jens-christian.fisc...@switch.ch [3]
>>> http://www.switch.ch
>>> 
>>> http://www.switch.ch/stories
>>> 
>>> On 25.05.2015, at 06:02, Christian Balzer  wrote:
>>> 
 Hello,
 
 lets compare your case with John-Paul's.
 
 Different OS and Ceph versions (thus we can assume different NFS
 versions
 as well).
 The only common thing is that both of you added OSDs and are likely
 suffering from delays stemming from Ceph re-balancing or
 deep-scrubbing.
 
 Ceph logs will only pipe up when things have been blocked for more
 than 30
 seconds, NFS might take offense to lower values (or the accumulation
 of
 several distributed delays).
 
 You added 23 OSDs, tell us more about your cluster, HW, network.
 Were these added to the existing 16 nodes, are these on new storage
 nodes
 (so could there be something different with those nodes?), how busy
 is your
 network, CPU.
 Running something like collectd to gather all ceph perf data and
 other
 data from the storage nodes and then feeding it to graphite (or
 similar)
 can be VERY helpful to identify if something is going wrong and what
 it is
 in particular.
 Otherwise run atop on your

Re: [ceph-users] NFS interaction with RBD

2015-06-11 Thread Christian Schnidrig

Hi George

Well that’s strange. I wonder why our systems behave so differently.

We’ve got:

Hypervisors running on Ubuntu 14.04. 
VMs with 9 ceph volumes: 2TB each.
XFS instead of your ext4

Maybe the number of placement groups plays a major role as well. Jens-Christian 
may be able to give you the specifics of our ceph cluster. 
I’m about to leave on vacation and don’t have time to look that up anymore.

Best regards
Christian


On 29 May 2015, at 14:42, Georgios Dimitrakakis  wrote:

> All,
> 
> I 've tried to recreate the issue without success!
> 
> My configuration is the following:
> 
> OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64)
> QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64
> Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB 
> OSDs equally distributed on two disk nodes, 3xMonitors
> 
> 
> OpenStack Cinder has been configured to provide RBD Volumes from Ceph.
> 
> I have created 10x 500GB Volumes which were then all attached at a single 
> Virtual Machine.
> 
> All volumes were formatted two times for comparison reasons, one using 
> "mkfs.xfs" and one using "mkfs.ext4".
> I did try to issue the commands all at the same time (or as possible to that).
> 
> In both tests I didn't notice any interruption. It may took longer than just 
> doing one at a time but the system was continuously up and everything was 
> responding without the problem.
> 
> At the time of these processes the open connections were 100 with one of the 
> OSD node and 111 with the other one.
> 
> So I guess I am not experiencing the issue due to the low number of OSDs I am 
> having. Is my assumption correct?
> 
> 
> Best regards,
> 
> George
> 
> 
> 
>> Thanks a million for the feedback Christian!
>> 
>> I 've tried to recreate the issue with 10RBD Volumes mounted on a
>> single server without success!
>> 
>> I 've issued the "mkfs.xfs" command simultaneously (or at least as
>> fast I could do it in different terminals) without noticing any
>> problems. Can you please tell me what was the size of each one of the
>> RBD Volumes cause I have a feeling that mine were two small, and if so
>> I have to test it on our bigger cluster.
>> 
>> I 've also thought that besides QEMU version it might also be
>> important the underlying OS, so what was your testbed?
>> 
>> 
>> All the best,
>> 
>> George
>> 
>>> Hi George
>>> 
>>> In order to experience the error it was enough to simply run mkfs.xfs
>>> on all the volumes.
>>> 
>>> 
>>> In the meantime it became clear what the problem was:
>>> 
>>> ~ ; cat /proc/183016/limits
>>> ...
>>> Max open files1024 4096 files
>>> ..
>>> 
>>> This can be changed by setting a decent value in
>>> /etc/libvirt/qemu.conf for max_files.
>>> 
>>> Regards
>>> Christian
>>> 
>>> 
>>> 
>>> On 27 May 2015, at 16:23, Jens-Christian Fischer
>>>  wrote:
>>> 
 George,
 
 I will let Christian provide you the details. As far as I know, it was 
 enough to just do a ‘ls’ on all of the attached drives.
 
 we are using Qemu 2.0:
 
 $ dpkg -l | grep qemu
 ii  ipxe-qemu   
 1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware - ROM 
 images for qemu
 ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11  all
   QEMU keyboard maps
 ii  qemu-system 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries
 ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (arm)
 ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (common files)
 ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (mips)
 ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (miscelaneous)
 ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (ppc)
 ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (sparc)
 ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (x86)
 ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU utilities
 
 cheers
 jc
 
 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 On 26.05.2015, at 19:12, Georgios Dimitrakakis  
 wrote:
 
> Jens-Christian,
> 
> how did you

Re: [ceph-users] NFS interaction with RBD

2015-05-29 Thread John-Paul Robinson

In the end this came down to one slow OSD.  There were no hardware
issues so have to just assume something gummed up during rebalancing and
peering.

I restarted the osd process after setting the cluster to noout.  After
the osd was restarted the rebalance completed and the cluster returned
to health ok.

As soon as the osd restarted all previously hanging operations returned
to normal.

I'm surprised by a single slow OSD impacting access to the entire
cluster.   I understand now that only the primary osd is used for reads
and writes must go to the primary then secondary, but I would have
expected  the impact to be more contained.

We currently build XFS file systems directly on RBD images.  I'm
wondering if there would be any value in using an LVM abstraction on top
to spread access to other osds  for read and failure scenarios.

Any thoughts on the above appreciated.

~jpr


On 05/28/2015 03:18 PM, John-Paul Robinson wrote:
> To follow up on the original post,
>
> Further digging indicates this is a problem with RBD image access and
> is not related to NFS-RBD interaction as initially suspected.  The
> nfsd is simply hanging as a result of a hung request to the XFS file
> system mounted on our RBD-NFS gateway.This hung XFS call is caused
> by a problem with the RBD module interacting with our Ceph pool.
>
> I've found a reliable way to trigger a hang directly on an rbd image
> mapped into our RBD-NFS gateway box.  The image contains an XFS file
> system.  When I try to list the contents of a particular directory,
> the request hangs indefinitely.
>
> Two weeks ago our ceph status was:
>
> jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
>health HEALTH_WARN 1 near full osd(s)
>monmap e1: 3 mons at
> 
> {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
> election epoch 350, quorum 0,1,2
> da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
>osdmap e5978: 66 osds: 66 up, 66 in
> pgmap v26434260: 3072 pgs: 3062 active+clean, 6
> active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
> data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
>mdsmap e1: 0/0/1 up
>
>
> The near full osd was number 53 and we updated our crush map to
> rewieght the osd.  All of the OSDs had a weight of 1 based on the
> assumption that all osds were 2.0TB.  Apparently one of our severs had
> the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough
> we are only at 50% utilization.  We reweighted the near full osd to .8
> and that initiated a rebalance that has since relieved the 95% full
> condition on that OSD.
>
> However, since that time the repeering has not completed and we
> suspect this is causing problems with our access of RBD images.   Our
> current ceph status is:
>
> jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
>health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
> stuck unclean; recovery 9/23842120 degraded (0.000%)
>monmap e1: 3 mons at
> 
> {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
> election epoch 350, quorum 0,1,2
> da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
>osdmap e6036: 66 osds: 66 up, 66 in
> pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
> active+clean+scrubbing, 1 remapped+peering, 3
> active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297
> GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
>mdsmap e1: 0/0/1 up
>
>
> Here are further details on our stuck pgs:
>
> jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
> dump_stuck inactive
> ok
> pg_stat objects mip degrunf bytes   log disklog
> state   state_stamp v   reportedup  acting 
> last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
> 3.3af   11600   0   0   0   47941791744 153812 
> 153812  remapped+peering2015-05-15 12:47:17.223786 
> 5979'293066  6000'1248735 [48,62] [53,48,62] 
> 5979'293056 2015-05-15 07:40:36.275563  5979'293056
> 2015-05-15 07:40:36.275563
>
> jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
> dump_stuck unclean
> ok
> pg_stat objects mip degrunf bytes   log disklog
> state   state_stamp v   reportedup  acting 
> last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
> 3.106   11870   0   9   0   49010106368 163991 
> 163991  active  2015-05-15 12:47:19.761469  6035'356332
> 5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
> 22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
> 5.104   0   0   0   0   0   0   0

Re: [ceph-users] NFS interaction with RBD

2015-05-29 Thread Georgios Dimitrakakis


All,

I 've tried to recreate the issue without success!

My configuration is the following:

OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64)
QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64
Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 
20x4TB OSDs equally distributed on two disk nodes, 3xMonitors



OpenStack Cinder has been configured to provide RBD Volumes from Ceph.

I have created 10x 500GB Volumes which were then all attached at a 
single Virtual Machine.


All volumes were formatted two times for comparison reasons, one using 
"mkfs.xfs" and one using "mkfs.ext4".
I did try to issue the commands all at the same time (or as possible to 
that).


In both tests I didn't notice any interruption. It may took longer than 
just doing one at a time but the system was continuously up and 
everything was responding without the problem.


At the time of these processes the open connections were 100 with one 
of the OSD node and 111 with the other one.


So I guess I am not experiencing the issue due to the low number of 
OSDs I am having. Is my assumption correct?



Best regards,

George




Thanks a million for the feedback Christian!

I 've tried to recreate the issue with 10RBD Volumes mounted on a
single server without success!

I 've issued the "mkfs.xfs" command simultaneously (or at least as
fast I could do it in different terminals) without noticing any
problems. Can you please tell me what was the size of each one of the
RBD Volumes cause I have a feeling that mine were two small, and if 
so

I have to test it on our bigger cluster.

I 've also thought that besides QEMU version it might also be
important the underlying OS, so what was your testbed?


All the best,

George


Hi George

In order to experience the error it was enough to simply run 
mkfs.xfs

on all the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 
files

..

This can be changed by setting a decent value in
/etc/libvirt/qemu.conf for max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer
 wrote:


George,

I will let Christian provide you the details. As far as I know, it 
was enough to just do a ‘ls’ on all of the attached drives.


we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   
1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware 
- ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11  
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11  
amd64QEMU utilities


cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis 
 wrote:



Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which 
are all exported via NFS to other VMs.
No one has complaint for the moment but the load/usage is very 
minimal.
If this problem really exists then very soon that the trial phase 
will be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided 
by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). 
As
soon as he hit all disks, we started to experience these 120 
second

timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread Georgios Dimitrakakis


Thanks a million for the feedback Christian!

I 've tried to recreate the issue with 10RBD Volumes mounted on a 
single server without success!


I 've issued the "mkfs.xfs" command simultaneously (or at least as fast 
I could do it in different terminals) without noticing any problems. Can 
you please tell me what was the size of each one of the RBD Volumes 
cause I have a feeling that mine were two small, and if so I have to 
test it on our bigger cluster.


I 've also thought that besides QEMU version it might also be important 
the underlying OS, so what was your testbed?



All the best,

George


Hi George

In order to experience the error it was enough to simply run mkfs.xfs
on all the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 
files

..

This can be changed by setting a decent value in
/etc/libvirt/qemu.conf for max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer
 wrote:


George,

I will let Christian provide you the details. As far as I know, it 
was enough to just do a ‘ls’ on all of the attached drives.


we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   
1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware - 
ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11   
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (common 
files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries 
(miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11   
amd64QEMU utilities


cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis 
 wrote:



Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which 
are all exported via NFS to other VMs.
No one has complaint for the moment but the load/usage is very 
minimal.
If this problem really exists then very soon that the trial phase 
will be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided 
by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). 
As
soon as he hit all disks, we started to experience these 120 
second

timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch [3]
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:


Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS
versions
as well).
The only common thing is that both of you added OSDs and are 
likely

suffering from delays stemming from Ceph re-balancing or
deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for 
more

than 30
seconds, NFS might take offense to lower values (or the 
accumulation

of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new 
storage

nodes
(so could there be something different with those nodes?), how 
busy

is your
network, CPU.
Running something like collectd to

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread John-Paul Robinson

To follow up on the original post,

Further digging indicates this is a problem with RBD image access and is
not related to NFS-RBD interaction as initially suspected.  The nfsd is
simply hanging as a result of a hung request to the XFS file system
mounted on our RBD-NFS gateway.This hung XFS call is caused by a
problem with the RBD module interacting with our Ceph pool.

I've found a reliable way to trigger a hang directly on an rbd image
mapped into our RBD-NFS gateway box.  The image contains an XFS file
system.  When I try to list the contents of a particular directory, the
request hangs indefinitely.

Two weeks ago our ceph status was:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 near full osd(s)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e5978: 66 osds: 66 up, 66 in
pgmap v26434260: 3072 pgs: 3062 active+clean, 6
active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
   mdsmap e1: 0/0/1 up


The near full osd was number 53 and we updated our crush map to rewieght
the osd.  All of the OSDs had a weight of 1 based on the assumption that
all osds were 2.0TB.  Apparently one of our severs had the OSDs Sized to
2.8TB and this caused the OSD imbalance eventhough we are only at 50%
utilization.  We reweighted the near full osd to .8 and that initiated a
rebalance that has since relieved the 95% full condition on that OSD.

However, since that time the repeering has not completed and we suspect
this is causing problems with our access of RBD images.   Our current
ceph status is:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
stuck unclean; recovery 9/23842120 degraded (0.000%)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e6036: 66 osds: 66 up, 66 in
pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
active+clean+scrubbing, 1 remapped+peering, 3
active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB
/ 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
   mdsmap e1: 0/0/1 up


Here are further details on our stuck pgs:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck inactive
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck unclean
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.106   11870   0   9   0   49010106368 163991 
163991  active  2015-05-15 12:47:19.761469  6035'356332
5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
5.104   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.800676  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:22.425105 
0'0 2015-05-08 10:19:54.938934
4.105   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.801028  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:04.434826 
0'0 2015-05-14 18:43:04.434826
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563


The servers in the pool are not overloaded.  On the ceph server that
originally had the nearly full osd, (osd 53), I'm seeing entries like
this in the osd log:

2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log [WRN] : 6 slow
requests, 6 included below; oldest blocked for > 1096430.805069 secs
2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow request
1096

Re: [ceph-users] NFS interaction with RBD

2015-05-27 Thread Trent Lloyd

Jens-Christian Fischer  writes:

> 
> I think we (i.e. Christian) found the problem:
> We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as 
he hit all disks, we started to experience these 120 second timeouts. We 
realized that the QEMU process on the hypervisor is opening a TCP connection 
to every OSD for every mounted volume - exceeding the 1024 FD limit.
> 
> So no deep scrubbing etc, but simply to many connections…

Have seen mention of similar from CERN in their presentations, found this 
post on a quick google.. might help?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-
December/026187.html

Cheers,
Trent
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS interaction with RBD

2015-05-27 Thread Jens-Christian Fischer

George,

I will let Christian provide you the details. As far as I know, it was enough 
to just do a ‘ls’ on all of the attached drives.

we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   1.0.0+git-2013.c3d1e78-2ubuntu1   
all  PXE boot firmware - ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11
amd64QEMU utilities

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis  wrote:

> Jens-Christian,
> 
> how did you test that? Did you just tried to write to them simultaneously? 
> Any other tests that one can perform to verify that?
> 
> In our installation we have a VM with 30 RBD volumes mounted which are all 
> exported via NFS to other VMs.
> No one has complaint for the moment but the load/usage is very minimal.
> If this problem really exists then very soon that the trial phase will be 
> over we will have millions of complaints :-(
> 
> What version of QEMU are you using? We are using the one provided by Ceph in 
> qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
> 
> Best regards,
> 
> George
> 
>> I think we (i.e. Christian) found the problem:
>> 
>> We created a test VM with 9 mounted RBD volumes (no NFS server). As
>> soon as he hit all disks, we started to experience these 120 second
>> timeouts. We realized that the QEMU process on the hypervisor is
>> opening a TCP connection to every OSD for every mounted volume -
>> exceeding the 1024 FD limit.
>> 
>> So no deep scrubbing etc, but simply to many connections…
>> 
>> cheers
>> jc
>> 
>> --
>> SWITCH
>> Jens-Christian Fischer, Peta Solutions
>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>> phone +41 44 268 15 15, direct +41 44 268 15 71
>> jens-christian.fisc...@switch.ch [3]
>> http://www.switch.ch
>> 
>> http://www.switch.ch/stories
>> 
>> On 25.05.2015, at 06:02, Christian Balzer  wrote:
>> 
>>> Hello,
>>> 
>>> lets compare your case with John-Paul's.
>>> 
>>> Different OS and Ceph versions (thus we can assume different NFS
>>> versions
>>> as well).
>>> The only common thing is that both of you added OSDs and are likely
>>> suffering from delays stemming from Ceph re-balancing or
>>> deep-scrubbing.
>>> 
>>> Ceph logs will only pipe up when things have been blocked for more
>>> than 30
>>> seconds, NFS might take offense to lower values (or the accumulation
>>> of
>>> several distributed delays).
>>> 
>>> You added 23 OSDs, tell us more about your cluster, HW, network.
>>> Were these added to the existing 16 nodes, are these on new storage
>>> nodes
>>> (so could there be something different with those nodes?), how busy
>>> is your
>>> network, CPU.
>>> Running something like collectd to gather all ceph perf data and
>>> other
>>> data from the storage nodes and then feeding it to graphite (or
>>> similar)
>>> can be VERY helpful to identify if something is going wrong and what
>>> it is
>>> in particular.
>>> Otherwise run atop on your storage nodes to identify if CPU,
>>> network,
>>> specific HDDs/OSDs are bottlenecks.
>>> 
>>> Deep scrubbing can be _very_ taxing, do your problems persist if
>>> inject
>>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower
>>> that
>>> until it hurts again) or if you turn of deep scrubs altogether for
>>> the
>>> moment?
>>> 
>>> Christian
>>> 
>>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>>> 
 We see something very similar on our Ceph cluster, starting as of
 today.
 
 We use a 16 node, 102 OSD Ceph installation as the basis for an
>>>

Re: [ceph-users] NFS interaction with RBD

2015-05-26 Thread Georgios Dimitrakakis


Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which are 
all exported via NFS to other VMs.

No one has complaint for the moment but the load/usage is very minimal.
If this problem really exists then very soon that the trial phase will 
be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided by 
Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). As
soon as he hit all disks, we started to experience these 120 second
timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

 --
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch [3]
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:


Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS
versions
as well).
The only common thing is that both of you added OSDs and are likely
suffering from delays stemming from Ceph re-balancing or
deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for more
than 30
seconds, NFS might take offense to lower values (or the accumulation
of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new storage
nodes
(so could there be something different with those nodes?), how busy
is your
network, CPU.
Running something like collectd to gather all ceph perf data and
other
data from the storage nodes and then feeding it to graphite (or
similar)
can be VERY helpful to identify if something is going wrong and what
it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU,
network,
specific HDDs/OSDs are bottlenecks.

Deep scrubbing can be _very_ taxing, do your problems persist if
inject
into your running cluster an "osd_scrub_sleep" value of "0.5" (lower
that
until it hurts again) or if you turn of deep scrubs altogether for
the
moment?

Christian

On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:


We see something very similar on our Ceph cluster, starting as of
today.

We use a 16 node, 102 OSD Ceph installation as the basis for an
Icehouse
OpenStack cluster (we applied the RBD patches for live migration
etc)

On this cluster we have a big ownCloud installation (Sync & Share)
that
stores its files on three NFS servers, each mounting 6 2TB RBD
volumes
and exposing them to around 10 web server VMs (we originally
started
with one NFS server with a 100TB volume, but that has become
unwieldy).
All of the servers (hypervisors, ceph storage nodes and VMs) are
using
Ubuntu 14.04

Yesterday evening we added 23 ODSs to the cluster bringing it up
to 125
OSDs (because we had 4 OSDs that were nearing the 90% full mark).
The
rebalancing process ended this morning (after around 12 hours) The
cluster has been clean since then:

cluster b1f3f4c8-x
health HEALTH_OK
monmap e2: 3 mons at





{zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},

election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
pools,
82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail
3319
active+clean 17 active+clean+scrubbing+deep
client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s

At midnight, we run a script that creates an RBD snapshot of all
RBD
volumes that are attached to the NFS servers (for backup
purposes).
Looking at our monitoring, around that time, one of the NFS
servers
became unresponsive and took down the complete ownCloud
installation
(load on the web server was > 200 and they had lost some of the
NFS
mounts)

Rebooting the NFS server solved that problem, but the NFS kernel
server
kept crashing all day long after having run between 10 to 90
minutes.

We initially suspected a corrupt rbd volume (as it seemed that we
could
trigger the kernel crash by just “ls -l” one of the volumes,
but
subsequent “xfs_repair -n” checks on those RBD volumes showed
no
problems.

We migrated the NFS server off of its hypervisor, suspecting a
problem
with RBD kernel modules, rebooted the hypervisor but the problem
persisted (both on the new hypervisor, and on the old one when we
migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256
servers
(even though the defaults had been working fine for over a year)

O

Re: [ceph-users] NFS interaction with RBD

2015-05-26 Thread Jens-Christian Fischer

I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he 
hit all disks, we started to experience these 120 second timeouts. We realized 
that the QEMU process on the hypervisor is opening a TCP connection to every 
OSD for every mounted volume - exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:

> 
> Hello,
> 
> lets compare your case with John-Paul's.
> 
> Different OS and Ceph versions (thus we can assume different NFS versions
> as well).
> The only common thing is that both of you added OSDs and are likely
> suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
> 
> Ceph logs will only pipe up when things have been blocked for more than 30
> seconds, NFS might take offense to lower values (or the accumulation of
> several distributed delays).
> 
> You added 23 OSDs, tell us more about your cluster, HW, network.
> Were these added to the existing 16 nodes, are these on new storage nodes
> (so could there be something different with those nodes?), how busy is your
> network, CPU.
> Running something like collectd to gather all ceph perf data and other
> data from the storage nodes and then feeding it to graphite (or similar)
> can be VERY helpful to identify if something is going wrong and what it is
> in particular.
> Otherwise run atop on your storage nodes to identify if CPU, network,
> specific HDDs/OSDs are bottlenecks. 
> 
> Deep scrubbing can be _very_ taxing, do your problems persist if inject
> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
> until it hurts again) or if you turn of deep scrubs altogether for the
> moment?
> 
> Christian
> 
> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
> 
>> We see something very similar on our Ceph cluster, starting as of today.
>> 
>> We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
>> OpenStack cluster (we applied the RBD patches for live migration etc)
>> 
>> On this cluster we have a big ownCloud installation (Sync & Share) that
>> stores its files on three NFS servers, each mounting 6 2TB RBD volumes
>> and exposing them to around 10 web server VMs (we originally started
>> with one NFS server with a 100TB volume, but that has become unwieldy).
>> All of the servers (hypervisors, ceph storage nodes and VMs) are using
>> Ubuntu 14.04
>> 
>> Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
>> rebalancing process ended this morning (after around 12 hours) The
>> cluster has been clean since then:
>> 
>>cluster b1f3f4c8-x
>> health HEALTH_OK
>> monmap e2: 3 mons at
>> {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
>> active+clean 17 active+clean+scrubbing+deep
>>  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>> 
>> At midnight, we run a script that creates an RBD snapshot of all RBD
>> volumes that are attached to the NFS servers (for backup purposes).
>> Looking at our monitoring, around that time, one of the NFS servers
>> became unresponsive and took down the complete ownCloud installation
>> (load on the web server was > 200 and they had lost some of the NFS
>> mounts)
>> 
>> Rebooting the NFS server solved that problem, but the NFS kernel server
>> kept crashing all day long after having run between 10 to 90 minutes.
>> 
>> We initially suspected a corrupt rbd volume (as it seemed that we could
>> trigger the kernel crash by just “ls -l” one of the volumes, but
>> subsequent “xfs_repair -n” checks on those RBD volumes showed no
>> problems.
>> 
>> We migrated the NFS server off of its hypervisor, suspecting a problem
>> with RBD kernel modules, rebooted the hypervisor but the problem
>> persisted (both on the new hypervisor, and on the old one when we
>> migrated it back)
>> 
>> We changed the /etc/default/nfs-kernel-server to start up 256 servers
>> (even though the defaults had been working fine for over a year)
>> 
>> Only one of our 3 NFS servers crashes (see below for syslog information)
>> - the other 2 have been fine
>> 
>> May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
>> 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
>> grace period (net 81cdab00) May 23 21:44:23 dri

Re: [ceph-users] NFS interaction with RBD

2015-05-24 Thread Christian Balzer


Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS versions
as well).
The only common thing is that both of you added OSDs and are likely
suffering from delays stemming from Ceph re-balancing or deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for more than 30
seconds, NFS might take offense to lower values (or the accumulation of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new storage nodes
(so could there be something different with those nodes?), how busy is your
network, CPU.
Running something like collectd to gather all ceph perf data and other
data from the storage nodes and then feeding it to graphite (or similar)
can be VERY helpful to identify if something is going wrong and what it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU, network,
specific HDDs/OSDs are bottlenecks. 

Deep scrubbing can be _very_ taxing, do your problems persist if inject
into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
until it hurts again) or if you turn of deep scrubs altogether for the
moment?

Christian

On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:

> We see something very similar on our Ceph cluster, starting as of today.
> 
> We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
> OpenStack cluster (we applied the RBD patches for live migration etc)
> 
> On this cluster we have a big ownCloud installation (Sync & Share) that
> stores its files on three NFS servers, each mounting 6 2TB RBD volumes
> and exposing them to around 10 web server VMs (we originally started
> with one NFS server with a 100TB volume, but that has become unwieldy).
> All of the servers (hypervisors, ceph storage nodes and VMs) are using
> Ubuntu 14.04
> 
> Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
> OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
> rebalancing process ended this morning (after around 12 hours) The
> cluster has been clean since then:
> 
> cluster b1f3f4c8-x
>  health HEALTH_OK
>  monmap e2: 3 mons at
> {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
> active+clean 17 active+clean+scrubbing+deep
>   client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
> 
> At midnight, we run a script that creates an RBD snapshot of all RBD
> volumes that are attached to the NFS servers (for backup purposes).
> Looking at our monitoring, around that time, one of the NFS servers
> became unresponsive and took down the complete ownCloud installation
> (load on the web server was > 200 and they had lost some of the NFS
> mounts)
> 
> Rebooting the NFS server solved that problem, but the NFS kernel server
> kept crashing all day long after having run between 10 to 90 minutes.
> 
> We initially suspected a corrupt rbd volume (as it seemed that we could
> trigger the kernel crash by just “ls -l” one of the volumes, but
> subsequent “xfs_repair -n” checks on those RBD volumes showed no
> problems.
> 
> We migrated the NFS server off of its hypervisor, suspecting a problem
> with RBD kernel modules, rebooted the hypervisor but the problem
> persisted (both on the new hypervisor, and on the old one when we
> migrated it back)
> 
> We changed the /etc/default/nfs-kernel-server to start up 256 servers
> (even though the defaults had been working fine for over a year)
> 
> Only one of our 3 NFS servers crashes (see below for syslog information)
> - the other 2 have been fine
> 
> May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
> 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
> grace period (net 81cdab00) May 23 21:44:23 drive-nfs1
> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
> kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
> 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0
> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
> [  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
> > /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
> > collectd[1872]: python: Plugin loaded but not configured. May 23
> > 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
> > read-loop. May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init:
> > plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
> > drive-nfs1 kernel: [  600.776177] INFO:

Re: [ceph-users] NFS interaction with RBD

2015-05-23 Thread Jens-Christian Fischer

We see something very similar on our Ceph cluster, starting as of today.

We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse 
OpenStack cluster (we applied the RBD patches for live migration etc)

On this cluster we have a big ownCloud installation (Sync & Share) that stores 
its files on three NFS servers, each mounting 6 2TB RBD volumes and exposing 
them to around 10 web server VMs (we originally started with one NFS server 
with a 100TB volume, but that has become unwieldy). All of the servers 
(hypervisors, ceph storage nodes and VMs) are using Ubuntu 14.04

Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 OSDs 
(because we had 4 OSDs that were nearing the 90% full mark). The rebalancing 
process ended this morning (after around 12 hours)
The cluster has been clean since then:

cluster b1f3f4c8-x
 health HEALTH_OK
 monmap e2: 3 mons at 
{zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
 election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025
 osdmap e43476: 125 osds: 125 up, 125 in
  pgmap v18928606: 3336 pgs, 17 pools, 82447 GB data, 22585 kobjects
266 TB used, 187 TB / 454 TB avail
3319 active+clean
  17 active+clean+scrubbing+deep
  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s

At midnight, we run a script that creates an RBD snapshot of all RBD volumes 
that are attached to the NFS servers (for backup purposes). Looking at our 
monitoring, around that time, one of the NFS servers became unresponsive and 
took down the complete ownCloud installation (load on the web server was > 200 
and they had lost some of the NFS mounts)

Rebooting the NFS server solved that problem, but the NFS kernel server kept 
crashing all day long after having run between 10 to 90 minutes.

We initially suspected a corrupt rbd volume (as it seemed that we could trigger 
the kernel crash by just “ls -l” one of the volumes, but subsequent “xfs_repair 
-n” checks on those RBD volumes showed no problems.

We migrated the NFS server off of its hypervisor, suspecting a problem with RBD 
kernel modules, rebooted the hypervisor but the problem persisted (both on the 
new hypervisor, and on the old one when we migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256 servers (even 
though the defaults had been working fine for over a year)

Only one of our 3 NFS servers crashes (see below for syslog information) - the 
other 2 have been fine

May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD: Using 
/var/lib/nfs/v4recovery as the NFSv4 state recovery directory
May 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second 
grace period (net 81cdab00)
May 23 21:44:23 drive-nfs1 rpc.mountd[1724]: Version 1.2.8 starting
May 23 21:44:28 drive-nfs1 kernel: [  182.917775] ip_tables: (C) 2000-2006 
Netfilter Core Team
May 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0 
(16384 buckets, 65536 max)
May 23 21:44:28 drive-nfs1 kernel: [  183.044091] ip6_tables: (C) 2000-2006 
Netfilter Core Team
May 23 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 > 
/dev/null && debian-sa1 1 1)
May 23 21:45:17 drive-nfs1 collectd[1872]: python: Plugin loaded but not 
configured.
May 23 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering 
read-loop.
May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init: plymouth-upstart-bridge 
main process ended, respawning
May 23 21:51:26 drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked 
for more than 120 seconds.
May 23 21:51:26 drive-nfs1 kernel: [  600.778090]   Not tainted 
3.13.0-53-generic #89-Ubuntu
May 23 21:51:26 drive-nfs1 kernel: [  600.779507] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 21:51:26 drive-nfs1 kernel: [  600.781504] nfsdD 
88013fd93180 0  1696  2 0x
May 23 21:51:26 drive-nfs1 kernel: [  600.781508]  8800b2391c50 
0046 8800b22f9800 8800b2391fd8
May 23 21:51:26 drive-nfs1 kernel: [  600.781511]  00013180 
00013180 8800b22f9800 880035f48240
May 23 21:51:26 drive-nfs1 kernel: [  600.781513]  880035f48244 
8800b22f9800  880035f48248
May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call Trace:
May 23 21:51:26 drive-nfs1 kernel: [  600.781523]  [] 
schedule_preempt_disabled+0x29/0x70
May 23 21:51:26 drive-nfs1 kernel: [  600.781526]  [] 
__mutex_lock_slowpath+0x135/0x1b0
May 23 21:51:26 drive-nfs1 kernel: [  600.781528]  [] 
mutex_lock+0x1f/0x2f
May 23 21:51:26 drive-nfs1 kernel: [  600.781557]  [] 
nfsd_lookup_dentry+0xa1/0x490 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781568]  [] ? 
fh_verify+0x14b/0x5e0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781591]  [] 
nfsd_lookup+0x69/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781613]  [] 
nfsd

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

Re: [ceph-users] NFS interaction with RBD

14 matches

Site Navigation

Mail list logo

Footer information