In the end this came down to one slow OSD.  There were no hardware
issues so have to just assume something gummed up during rebalancing and
peering.

I restarted the osd process after setting the cluster to noout.  After
the osd was restarted the rebalance completed and the cluster returned
to health ok.

As soon as the osd restarted all previously hanging operations returned
to normal.

I'm surprised by a single slow OSD impacting access to the entire
cluster.   I understand now that only the primary osd is used for reads
and writes must go to the primary then secondary, but I would have
expected  the impact to be more contained.

We currently build XFS file systems directly on RBD images.  I'm
wondering if there would be any value in using an LVM abstraction on top
to spread access to other osds  for read and failure scenarios.

Any thoughts on the above appreciated.

~jpr


On 05/28/2015 03:18 PM, John-Paul Robinson wrote:
> To follow up on the original post,
>
> Further digging indicates this is a problem with RBD image access and
> is not related to NFS-RBD interaction as initially suspected.  The
> nfsd is simply hanging as a result of a hung request to the XFS file
> system mounted on our RBD-NFS gateway.    This hung XFS call is caused
> by a problem with the RBD module interacting with our Ceph pool.
>
> I've found a reliable way to trigger a hang directly on an rbd image
> mapped into our RBD-NFS gateway box.  The image contains an XFS file
> system.  When I try to list the contents of a particular directory,
> the request hangs indefinitely.
>
> Two weeks ago our ceph status was:
>
>     jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
>        health HEALTH_WARN 1 near full osd(s)
>        monmap e1: 3 mons at
>     
> {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
>     election epoch 350, quorum 0,1,2
>     da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
>        osdmap e5978: 66 osds: 66 up, 66 in
>         pgmap v26434260: 3072 pgs: 3062 active+clean, 6
>     active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
>     data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
>        mdsmap e1: 0/0/1 up
>
>
> The near full osd was number 53 and we updated our crush map to
> rewieght the osd.  All of the OSDs had a weight of 1 based on the
> assumption that all osds were 2.0TB.  Apparently one of our severs had
> the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough
> we are only at 50% utilization.  We reweighted the near full osd to .8
> and that initiated a rebalance that has since relieved the 95% full
> condition on that OSD.
>
> However, since that time the repeering has not completed and we
> suspect this is causing problems with our access of RBD images.   Our
> current ceph status is:
>
>     jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
>        health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
>     stuck unclean; recovery 9/23842120 degraded (0.000%)
>        monmap e1: 3 mons at
>     
> {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
>     election epoch 350, quorum 0,1,2
>     da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
>        osdmap e6036: 66 osds: 66 up, 66 in
>         pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
>     active+clean+scrubbing, 1 remapped+peering, 3
>     active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297
>     GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
>        mdsmap e1: 0/0/1 up
>
>
> Here are further details on our stuck pgs:
>
>     jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
>     dump_stuck inactive
>     ok
>     pg_stat objects mip     degr    unf     bytes   log     disklog
>     state   state_stamp     v       reported        up      acting 
>     last_scrub   scrub_stamp      last_deep_scrub deep_scrub_stamp
>     3.3af   11600   0       0       0       47941791744     153812 
>     153812  remapped+peering        2015-05-15 12:47:17.223786     
>     5979'293066  6000'1248735     [48,62] [53,48,62]     
>     5979'293056     2015-05-15 07:40:36.275563      5979'293056    
>     2015-05-15 07:40:36.275563
>
>     jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
>     dump_stuck unclean
>     ok
>     pg_stat objects mip     degr    unf     bytes   log     disklog
>     state   state_stamp     v       reported        up      acting 
>     last_scrub   scrub_stamp      last_deep_scrub deep_scrub_stamp
>     3.106   11870   0       9       0       49010106368     163991 
>     163991  active  2015-05-15 12:47:19.761469      6035'356332    
>     5968'1358516 [62,53]  [62,53] 5979'356242     2015-05-14
>     22:22:12.966150      5979'351351     2015-05-12 18:04:41.838686
>     5.104   0       0       0       0       0       0       0      
>     active  2015-05-15 12:47:19.800676      0'0     5968'1615      
>     [62,53] [62,53]       0'0     2015-05-14 18:43:22.425105     
>     0'0     2015-05-08 10:19:54.938934
>     4.105   0       0       0       0       0       0       0      
>     active  2015-05-15 12:47:19.801028      0'0     5968'1615      
>     [62,53] [62,53]       0'0     2015-05-14 18:43:04.434826     
>     0'0     2015-05-14 18:43:04.434826
>     3.3af   11600   0       0       0       47941791744     153812 
>     153812  remapped+peering        2015-05-15 12:47:17.223786     
>     5979'293066  6000'1248735     [48,62] [53,48,62]     
>     5979'293056     2015-05-15 07:40:36.275563      5979'293056    
>     2015-05-15 07:40:36.275563
>
>
> The servers in the pool are not overloaded.  On the ceph server that
> originally had the nearly full osd, (osd 53), I'm seeing entries like
> this in the osd log:
>
>     2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log [WRN] : 6 slow
>     requests, 6 included below; oldest blocked for > 1096430.805069 secs
>     2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow
>     request 1096430.805069 seconds old, received at 2015-05-15
>     13:51:12.094956: osd_op
>     (client.153458.0:274955541 rb.0.d204.238e1f29.0000000844d2
>     [sparse-read 3706880~4096] 3.773a03af) v4 currently reached pg
>     2015-05-28 06:25:02.900156 7f2ea8a4f700  0 log [WRN] : slow
>     request 1095097.235862 seconds old, received at 2015-05-15
>     14:13:25.664163: osd_op(client.154124.1:68230513
>     rb.0.176f3.238e1f29.000000002cd8 [write 0~524288] 3.4329faf) v4
>     currently reached pg
>     2015-05-28 06:25:02.900162 7f2ea8a4f700  0 log [WRN] : slow
>     request 1095097.214366 seconds old, received at 2015-05-15
>     14:13:25.685659: osd_op(client.154124.1:68230514
>     rb.0.176f3.238e1f29.000000002cd8 [write 524288~524288] 3.4329faf)
>     v4 currently reached pg
>     2015-05-28 06:25:02.900169 7f2ea8a4f700  0 log [WRN] : slow
>     request 1095097.191725 seconds old, received at 2015-05-15
>     14:13:25.708300: osd_op(client.154124.1:68230515
>     rb.0.176f3.238e1f29.000000002cd8 [write 1048576~524288] 3.4329faf)
>     v4 currently reached pg
>     2015-05-28 06:25:02.900175 7f2ea8a4f700  0 log [WRN] : slow
>     request 1095097.169399 seconds old, received at 2015-05-15
>     14:13:25.730626: osd_op(client.154124.1:68230516
>     ...
>
>
> These happen once per second and the only thing that changes in the
> whole batch of messages is that the seconds old increases.
>
> According to dmesg, syslog, and bmc there are no current hw
> malfunctions on this machine.
>
> I'm assuming that whatever is causing the slow request in the osd logs
> might be gumming up the works.
>
> What I don't get is how to get my pg out of the stuck inactive state. 
> The doc solution hints don't seem to apply in this case.
>
> Finally here is the health detail, it sorta repeats the above:
>
>     jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova osd lspools
>     3 rbd,4 data,5 metadata,
>     jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova health detail
>     HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck
>     unclean; recovery 9/23842120 degraded (0.000%)
>     pg 3.3af is stuck inactive for 1132739.583540, current state
>     remapped+peering, last acting [53,48,62]
>     pg 3.106 is stuck unclean for 1131876.351202, current state
>     active, last acting [62,53]
>     pg 5.104 is stuck unclean for 1131876.350931, current state
>     active, last acting [62,53]
>     pg 4.105 is stuck unclean for 1131876.351729, current state
>     active, last acting [62,53]
>     pg 3.3af is stuck unclean for 1132739.583697, current state
>     remapped+peering, last acting [53,48,62]
>     pg 3.3af is remapped+peering, acting [53,48,62]
>     recovery 9/23842120 degraded (0.000%)
>
> I'd like to get to ceph health OK to see if that's contributing to the
> problem.
>
> Thanks for any pointers.
>
> ~jpr
>
> On 05/24/2015 11:02 PM, Christian Balzer wrote:
>> Hello,
>>
>> lets compare your case with John-Paul's.
>>
>> Different OS and Ceph versions (thus we can assume different NFS versions
>> as well).
>> The only common thing is that both of you added OSDs and are likely
>> suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
>>
>> Ceph logs will only pipe up when things have been blocked for more than 30
>> seconds, NFS might take offense to lower values (or the accumulation of
>> several distributed delays).
>>
>> You added 23 OSDs, tell us more about your cluster, HW, network.
>> Were these added to the existing 16 nodes, are these on new storage nodes
>> (so could there be something different with those nodes?), how busy is your
>> network, CPU.
>> Running something like collectd to gather all ceph perf data and other
>> data from the storage nodes and then feeding it to graphite (or similar)
>> can be VERY helpful to identify if something is going wrong and what it is
>> in particular.
>> Otherwise run atop on your storage nodes to identify if CPU, network,
>> specific HDDs/OSDs are bottlenecks. 
>>
>> Deep scrubbing can be _very_ taxing, do your problems persist if inject
>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
>> until it hurts again) or if you turn of deep scrubs altogether for the
>> moment?
>>
>> Christian
>>
>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>>
>>> We see something very similar on our Ceph cluster, starting as of today.
>>>
>>> We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
>>> OpenStack cluster (we applied the RBD patches for live migration etc)
>>>
>>> On this cluster we have a big ownCloud installation (Sync & Share) that
>>> stores its files on three NFS servers, each mounting 6 2TB RBD volumes
>>> and exposing them to around 10 web server VMs (we originally started
>>> with one NFS server with a 100TB volume, but that has become unwieldy).
>>> All of the servers (hypervisors, ceph storage nodes and VMs) are using
>>> Ubuntu 14.04
>>>
>>> Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
>>> rebalancing process ended this morning (after around 12 hours) The
>>> cluster has been clean since then:
>>>
>>>     cluster b1f3f4c8-xxxxx
>>>      health HEALTH_OK
>>>      monmap e2: 3 mons at
>>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
>>> active+clean 17 active+clean+scrubbing+deep
>>>   client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>>>
>>> At midnight, we run a script that creates an RBD snapshot of all RBD
>>> volumes that are attached to the NFS servers (for backup purposes).
>>> Looking at our monitoring, around that time, one of the NFS servers
>>> became unresponsive and took down the complete ownCloud installation
>>> (load on the web server was > 200 and they had lost some of the NFS
>>> mounts)
>>>
>>> Rebooting the NFS server solved that problem, but the NFS kernel server
>>> kept crashing all day long after having run between 10 to 90 minutes.
>>>
>>> We initially suspected a corrupt rbd volume (as it seemed that we could
>>> trigger the kernel crash by just “ls -l” one of the volumes, but
>>> subsequent “xfs_repair -n” checks on those RBD volumes showed no
>>> problems.
>>>
>>> We migrated the NFS server off of its hypervisor, suspecting a problem
>>> with RBD kernel modules, rebooted the hypervisor but the problem
>>> persisted (both on the new hypervisor, and on the old one when we
>>> migrated it back)
>>>
>>> We changed the /etc/default/nfs-kernel-server to start up 256 servers
>>> (even though the defaults had been working fine for over a year)
>>>
>>> Only one of our 3 NFS servers crashes (see below for syslog information)
>>> - the other 2 have been fine
>>>
>>> May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
>>> 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
>>> kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
>>> 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0
>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>>> [  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init:
>>>> plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
>>>> drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked for
>>>> more than 120 seconds.
>>> May 23 21:51:26 drive-nfs1 kernel: [  600.778090]       Not tainted
>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [  600.781504]
>>> nfsd            D ffff88013fd93180     0  1696      2 0x00000000 May 23
>>> 21:51:26 drive-nfs1 kernel: [  600.781508]  ffff8800b2391c50
>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>>> drive-nfs1 kernel: [  600.781511]  0000000000013180 0000000000013180
>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781513]  ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call
>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [  600.781523]
>>> [<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>>> drive-nfs1 kernel: [  600.781526]  [<ffffffff817295b5>]
>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781528]  [<ffffffff8172964f>] mutex_lock+0x1f/0x2f May 23
>>> 21:51:26 drive-nfs1 kernel: [  600.781557]  [<ffffffffa03b1761>]
>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781568]  [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd] May
>>> 23 21:51:26 drive-nfs1 kernel: [  600.781591]  [<ffffffffa03b1bb9>]
>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781613]  [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd] May
>>> 23 21:51:26 drive-nfs1 kernel: [  600.781628]  [<ffffffffa03c055a>]
>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>>> kernel: [  600.781638]  [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200
>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [  600.781662]
>>> [<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>>> 21:51:26 drive-nfs1 kernel: [  600.781678]  [<ffffffffa0287997>]
>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781687]  [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd] May 23
>>> 21:51:26 drive-nfs1 kernel: [  600.781696]  [<ffffffffa03ac660>] ?
>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781702]  [<ffffffff8108b6b2>] kthread+0xd2/0xf0 May 23 21:51:26
>>> drive-nfs1 kernel: [  600.781707]  [<ffffffff8108b5e0>] ?
>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 kernel:
>>> [  600.781712]  [<ffffffff81733868>] ret_from_fork+0x58/0x90 May 23
>>> 21:51:26 drive-nfs1 kernel: [  600.781717]  [<ffffffff8108b5e0>] ?
>>> kthread_create_on_node+0x1c0/0x1c0
>>>
>>> Before each crash, we see the disk utilization of one or two random
>>> mounted RBD volumes to go to 100% - there is no pattern on which of the
>>> RBD disks start to act up.
>>>
>>> We have scoured the log files of the Ceph cluster for any signs of
>>> problems but came up empty.
>>>
>>> The NFS server has almost no load (compared to regular usage) as most
>>> sync clients are either turned off (weekend) or have given up connecting
>>> to the server. 
>>>
>>> There haven't been any configuration change on the NFS servers prior to
>>> the problems. The only change was the adding of 23 OSDs.
>>>
>>> We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>
>>> Our team is completely out of ideas. We have removed the 100TB volume
>>> from the nfs server (we used the downtime to migrate the last data off
>>> of it to one of the smaller volumes). The NFS server has been running
>>> for 30 minutes now (with close to no load) but we don’t really expect it
>>> to make it until tomorrow.
>>>
>>> send help
>>> Jens-Christian
>>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to