To follow up on the original post,

Further digging indicates this is a problem with RBD image access and is
not related to NFS-RBD interaction as initially suspected.  The nfsd is
simply hanging as a result of a hung request to the XFS file system
mounted on our RBD-NFS gateway.    This hung XFS call is caused by a
problem with the RBD module interacting with our Ceph pool.

I've found a reliable way to trigger a hang directly on an rbd image
mapped into our RBD-NFS gateway box.  The image contains an XFS file
system.  When I try to list the contents of a particular directory, the
request hangs indefinitely.

Two weeks ago our ceph status was:

    jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
       health HEALTH_WARN 1 near full osd(s)
       monmap e1: 3 mons at
    
{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
    election epoch 350, quorum 0,1,2
    da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
       osdmap e5978: 66 osds: 66 up, 66 in
        pgmap v26434260: 3072 pgs: 3062 active+clean, 6
    active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
    data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
       mdsmap e1: 0/0/1 up


The near full osd was number 53 and we updated our crush map to rewieght
the osd.  All of the OSDs had a weight of 1 based on the assumption that
all osds were 2.0TB.  Apparently one of our severs had the OSDs Sized to
2.8TB and this caused the OSD imbalance eventhough we are only at 50%
utilization.  We reweighted the near full osd to .8 and that initiated a
rebalance that has since relieved the 95% full condition on that OSD.

However, since that time the repeering has not completed and we suspect
this is causing problems with our access of RBD images.   Our current
ceph status is:

    jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
       health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
    stuck unclean; recovery 9/23842120 degraded (0.000%)
       monmap e1: 3 mons at
    
{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
    election epoch 350, quorum 0,1,2
    da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
       osdmap e6036: 66 osds: 66 up, 66 in
        pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
    active+clean+scrubbing, 1 remapped+peering, 3
    active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB
    / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
       mdsmap e1: 0/0/1 up


Here are further details on our stuck pgs:

    jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
    dump_stuck inactive
    ok
    pg_stat objects mip     degr    unf     bytes   log     disklog
    state   state_stamp     v       reported        up      acting 
    last_scrub   scrub_stamp      last_deep_scrub deep_scrub_stamp
    3.3af   11600   0       0       0       47941791744     153812 
    153812  remapped+peering        2015-05-15 12:47:17.223786     
    5979'293066  6000'1248735     [48,62] [53,48,62]     
    5979'293056     2015-05-15 07:40:36.275563      5979'293056    
    2015-05-15 07:40:36.275563

    jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
    dump_stuck unclean
    ok
    pg_stat objects mip     degr    unf     bytes   log     disklog
    state   state_stamp     v       reported        up      acting 
    last_scrub   scrub_stamp      last_deep_scrub deep_scrub_stamp
    3.106   11870   0       9       0       49010106368     163991 
    163991  active  2015-05-15 12:47:19.761469      6035'356332    
    5968'1358516 [62,53]  [62,53] 5979'356242     2015-05-14
    22:22:12.966150      5979'351351     2015-05-12 18:04:41.838686
    5.104   0       0       0       0       0       0       0      
    active  2015-05-15 12:47:19.800676      0'0     5968'1615      
    [62,53] [62,53]       0'0     2015-05-14 18:43:22.425105     
    0'0     2015-05-08 10:19:54.938934
    4.105   0       0       0       0       0       0       0      
    active  2015-05-15 12:47:19.801028      0'0     5968'1615      
    [62,53] [62,53]       0'0     2015-05-14 18:43:04.434826     
    0'0     2015-05-14 18:43:04.434826
    3.3af   11600   0       0       0       47941791744     153812 
    153812  remapped+peering        2015-05-15 12:47:17.223786     
    5979'293066  6000'1248735     [48,62] [53,48,62]     
    5979'293056     2015-05-15 07:40:36.275563      5979'293056    
    2015-05-15 07:40:36.275563


The servers in the pool are not overloaded.  On the ceph server that
originally had the nearly full osd, (osd 53), I'm seeing entries like
this in the osd log:

    2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log [WRN] : 6 slow
    requests, 6 included below; oldest blocked for > 1096430.805069 secs
    2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow request
    1096430.805069 seconds old, received at 2015-05-15 13:51:12.094956:
    osd_op
    (client.153458.0:274955541 rb.0.d204.238e1f29.0000000844d2
    [sparse-read 3706880~4096] 3.773a03af) v4 currently reached pg
    2015-05-28 06:25:02.900156 7f2ea8a4f700  0 log [WRN] : slow request
    1095097.235862 seconds old, received at 2015-05-15 14:13:25.664163:
    osd_op(client.154124.1:68230513 rb.0.176f3.238e1f29.000000002cd8
    [write 0~524288] 3.4329faf) v4 currently reached pg
    2015-05-28 06:25:02.900162 7f2ea8a4f700  0 log [WRN] : slow request
    1095097.214366 seconds old, received at 2015-05-15 14:13:25.685659:
    osd_op(client.154124.1:68230514 rb.0.176f3.238e1f29.000000002cd8
    [write 524288~524288] 3.4329faf) v4 currently reached pg
    2015-05-28 06:25:02.900169 7f2ea8a4f700  0 log [WRN] : slow request
    1095097.191725 seconds old, received at 2015-05-15 14:13:25.708300:
    osd_op(client.154124.1:68230515 rb.0.176f3.238e1f29.000000002cd8
    [write 1048576~524288] 3.4329faf) v4 currently reached pg
    2015-05-28 06:25:02.900175 7f2ea8a4f700  0 log [WRN] : slow request
    1095097.169399 seconds old, received at 2015-05-15 14:13:25.730626:
    osd_op(client.154124.1:68230516
    ...


These happen once per second and the only thing that changes in the
whole batch of messages is that the seconds old increases.

According to dmesg, syslog, and bmc there are no current hw malfunctions
on this machine.

I'm assuming that whatever is causing the slow request in the osd logs
might be gumming up the works.

What I don't get is how to get my pg out of the stuck inactive state. 
The doc solution hints don't seem to apply in this case.

Finally here is the health detail, it sorta repeats the above:

    jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova osd lspools
    3 rbd,4 data,5 metadata,
    jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova health detail
    HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck
    unclean; recovery 9/23842120 degraded (0.000%)
    pg 3.3af is stuck inactive for 1132739.583540, current state
    remapped+peering, last acting [53,48,62]
    pg 3.106 is stuck unclean for 1131876.351202, current state active,
    last acting [62,53]
    pg 5.104 is stuck unclean for 1131876.350931, current state active,
    last acting [62,53]
    pg 4.105 is stuck unclean for 1131876.351729, current state active,
    last acting [62,53]
    pg 3.3af is stuck unclean for 1132739.583697, current state
    remapped+peering, last acting [53,48,62]
    pg 3.3af is remapped+peering, acting [53,48,62]
    recovery 9/23842120 degraded (0.000%)

I'd like to get to ceph health OK to see if that's contributing to the
problem.

Thanks for any pointers.

~jpr

On 05/24/2015 11:02 PM, Christian Balzer wrote:
> Hello,
>
> lets compare your case with John-Paul's.
>
> Different OS and Ceph versions (thus we can assume different NFS versions
> as well).
> The only common thing is that both of you added OSDs and are likely
> suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
>
> Ceph logs will only pipe up when things have been blocked for more than 30
> seconds, NFS might take offense to lower values (or the accumulation of
> several distributed delays).
>
> You added 23 OSDs, tell us more about your cluster, HW, network.
> Were these added to the existing 16 nodes, are these on new storage nodes
> (so could there be something different with those nodes?), how busy is your
> network, CPU.
> Running something like collectd to gather all ceph perf data and other
> data from the storage nodes and then feeding it to graphite (or similar)
> can be VERY helpful to identify if something is going wrong and what it is
> in particular.
> Otherwise run atop on your storage nodes to identify if CPU, network,
> specific HDDs/OSDs are bottlenecks. 
>
> Deep scrubbing can be _very_ taxing, do your problems persist if inject
> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
> until it hurts again) or if you turn of deep scrubs altogether for the
> moment?
>
> Christian
>
> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>
>> We see something very similar on our Ceph cluster, starting as of today.
>>
>> We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
>> OpenStack cluster (we applied the RBD patches for live migration etc)
>>
>> On this cluster we have a big ownCloud installation (Sync & Share) that
>> stores its files on three NFS servers, each mounting 6 2TB RBD volumes
>> and exposing them to around 10 web server VMs (we originally started
>> with one NFS server with a 100TB volume, but that has become unwieldy).
>> All of the servers (hypervisors, ceph storage nodes and VMs) are using
>> Ubuntu 14.04
>>
>> Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
>> rebalancing process ended this morning (after around 12 hours) The
>> cluster has been clean since then:
>>
>>     cluster b1f3f4c8-xxxxx
>>      health HEALTH_OK
>>      monmap e2: 3 mons at
>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
>> active+clean 17 active+clean+scrubbing+deep
>>   client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>>
>> At midnight, we run a script that creates an RBD snapshot of all RBD
>> volumes that are attached to the NFS servers (for backup purposes).
>> Looking at our monitoring, around that time, one of the NFS servers
>> became unresponsive and took down the complete ownCloud installation
>> (load on the web server was > 200 and they had lost some of the NFS
>> mounts)
>>
>> Rebooting the NFS server solved that problem, but the NFS kernel server
>> kept crashing all day long after having run between 10 to 90 minutes.
>>
>> We initially suspected a corrupt rbd volume (as it seemed that we could
>> trigger the kernel crash by just “ls -l” one of the volumes, but
>> subsequent “xfs_repair -n” checks on those RBD volumes showed no
>> problems.
>>
>> We migrated the NFS server off of its hypervisor, suspecting a problem
>> with RBD kernel modules, rebooted the hypervisor but the problem
>> persisted (both on the new hypervisor, and on the old one when we
>> migrated it back)
>>
>> We changed the /etc/default/nfs-kernel-server to start up 256 servers
>> (even though the defaults had been working fine for over a year)
>>
>> Only one of our 3 NFS servers crashes (see below for syslog information)
>> - the other 2 have been fine
>>
>> May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
>> 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
>> kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
>> 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0
>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>> [  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init:
>>> plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
>>> drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked for
>>> more than 120 seconds.
>> May 23 21:51:26 drive-nfs1 kernel: [  600.778090]       Not tainted
>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>> [  600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [  600.781504]
>> nfsd            D ffff88013fd93180     0  1696      2 0x00000000 May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781508]  ffff8800b2391c50
>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>> drive-nfs1 kernel: [  600.781511]  0000000000013180 0000000000013180
>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781513]  ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call
>> Trace: May 23 21:51:26 drive-nfs1 kernel: [  600.781523]
>> [<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>> drive-nfs1 kernel: [  600.781526]  [<ffffffff817295b5>]
>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781528]  [<ffffffff8172964f>] mutex_lock+0x1f/0x2f May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781557]  [<ffffffffa03b1761>]
>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781568]  [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd] May
>> 23 21:51:26 drive-nfs1 kernel: [  600.781591]  [<ffffffffa03b1bb9>]
>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781613]  [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd] May
>> 23 21:51:26 drive-nfs1 kernel: [  600.781628]  [<ffffffffa03c055a>]
>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>> kernel: [  600.781638]  [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200
>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [  600.781662]
>> [<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781678]  [<ffffffffa0287997>]
>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781687]  [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd] May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781696]  [<ffffffffa03ac660>] ?
>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781702]  [<ffffffff8108b6b2>] kthread+0xd2/0xf0 May 23 21:51:26
>> drive-nfs1 kernel: [  600.781707]  [<ffffffff8108b5e0>] ?
>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781712]  [<ffffffff81733868>] ret_from_fork+0x58/0x90 May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781717]  [<ffffffff8108b5e0>] ?
>> kthread_create_on_node+0x1c0/0x1c0
>>
>> Before each crash, we see the disk utilization of one or two random
>> mounted RBD volumes to go to 100% - there is no pattern on which of the
>> RBD disks start to act up.
>>
>> We have scoured the log files of the Ceph cluster for any signs of
>> problems but came up empty.
>>
>> The NFS server has almost no load (compared to regular usage) as most
>> sync clients are either turned off (weekend) or have given up connecting
>> to the server. 
>>
>> There haven't been any configuration change on the NFS servers prior to
>> the problems. The only change was the adding of 23 OSDs.
>>
>> We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>
>> Our team is completely out of ideas. We have removed the 100TB volume
>> from the nfs server (we used the downtime to migrate the last data off
>> of it to one of the smaller volumes). The NFS server has been running
>> for 30 minutes now (with close to no load) but we don’t really expect it
>> to make it until tomorrow.
>>
>> send help
>> Jens-Christian
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to