Hi Gregory,

On Thu, Dec 8, 2016 at 12:10 AM, Gregory Farnum <gfar...@redhat.com> wrote:
> In slightly more detail: you are clearly seeing a problem with the
> messenger, as indicated by the sock_recvmsg at the top of the CPU
> usage list. We've seen this elsewhere very rarely, which is why
> there's already a backport queued up which we didn't block on.
> The 15-minute period you're seeing is the default timeout we set on
> sockets before we start marking them closed if there's no activity.
>
> We're not quite sure why it's causing trouble now, although we have
> one or two patches we are speculating about and looking into.
>
> This didn't turn up in testing because as best we can tell it's only a
> situation you can expect to encounter when you have idle TCP
> connections between systems (or in fairly artificial failed
> networking).

For the OSD's doing 100% cpu, strace indeed shows EAGAIN a lot on some
of the sockets.
I'll try to get some packet captures if I can.

Kind regards,

Ruben
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to