Re: [ceph-users] Kernel RBD hang on OSD Failure

Tom Christensen Tue, 08 Dec 2015 02:44:31 -0800

We haven't submitted a ticket as we've just avoided using the kernel
client.  We've periodically tried with various kernels and various versions
of ceph over the last two years, but have just given up each time and
reverted to using rbd-fuse, which although not super stable, at least
doesn't hang the client box.  We find ourselves in the position now where
for additional functionality we *need* an actual block device, so we have
to find a kernel client that works.  I will certainly keep you posted and
can produce the output you've requested.


I'd also be willing to run an early 4.5 version in our test environment.

On Tue, Dec 8, 2015 at 3:35 AM, Ilya Dryomov <idryo...@gmail.com> wrote:

> On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensen <pav...@gmail.com> wrote:
> > We aren't running NFS, but regularly use the kernel driver to map RBDs
> and
> > mount filesystems in same.  We see very similar behavior across nearly
> all
> > kernel versions we've tried.  In my experience only very few versions of
> the
> > kernel driver survive any sort of crush map change/update while
> something is
> > mapped.  In fact in the last 2 years I think I've only seen this work on
> 1
> > kernel version unfortunately its badly out of date and we can't run it in
> > our environment anymore, I think it was a 3.0 kernel version running on
> > ubuntu 12.04.  We have just recently started trying to find a kernel that
> > will survive OSD outages or changes to the cluster.  We're on ubuntu
> 14.04,
> > and have tried 3.16, 3.19.0-25, 4.3, and 4.2 without success in the last
> > week.  We only map 1-3 RBDs per client machine at a time but we regularly
> > will get processes stuck in D state which are accessing the filesystem
> > inside the RBD and will have to hard reboot the RBD client machine.
> This is
> > always associated with a cluster change in some way, reweighting OSDs,
> > rebooting an OSD host, restarting an individual OSD, adding OSDs, and
> > removing OSDs all cause the kernel client to hang.  If no change is made
> to
> > the cluster, the kernel client will be happy for weeks.
>
> There are a couple of known bugs in the remap/resubmit area, but those
> are supposedly corner cases (like *all* the OSDs going down and then
> back up, etc).  I had no idea it was that severe and goes that back.
> Apparently triggering it requires a heavier load, as we've never seen
> anything like that in our tests.
>
> For unrelated reasons, remap/resubmit code is getting entirely
> rewritten for kernel 4.5, so, if you've been dealing with this issue
> for the last two years (I don't remember seeing any tickets listing
> that many kernel versions and not mentioning NFS), I'm afraid the best
> course of action for you would be to wait for 4.5 to come out and try
> it.  If you'd be willing to test out an early version on one of more of
> your client boxes, I can ping you when it's ready.
>
> I'll take a look at 3.0 vs 3.16 with an eye on remap code.  Did you
> happen to try 3.10?
>
> It sounds like you can reproduce this pretty easily.  Can you get it to
> lock up and do:
>
> # cat /sys/kernel/debug/ceph/*/osdmap
> # cat /sys/kernel/debug/ceph/*/osdc
> $ ceph status
>
> and bunch of times?  I have a hunch that kernel client simply fails to
> request enough of new osdmaps after the cluster topology changes under
> load.
>
> Thanks,
>
>                 Ilya
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Kernel RBD hang on OSD Failure

Reply via email to