Re: [corosync] Corosync consume 100% cpu with high Recv-Q and hung

Hui Xiang Tue, 21 Apr 2015 07:41:54 -0700

In the master branch function qb_rb_chunk_alloc() maybe failed due
to _rb_chunk_reclai() return -EINVAL on the condition
that (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)), where I got
backtrace:


451                     while (qb_rb_space_free(rb) < (len +
QB_RB_CHUNK_MARGIN)) {
(gdb) p qb_rb_space_free(rb)
$5 = 408
(gdb) p len
$6 = 561

QB_RB_CHUNK_MARGIN apparently equals to 12
#define QB_RB_CHUNK_MARGIN (sizeof(uint32_t) * (QB_RB_CHUNK_HEADER_WORDS +\
                                                QB_RB_WORD_ALIGN +\
                                                QB_CACHE_LINE_WORDS))



qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
{
       .....
        /*
         * Reclaim data if we are over writing and we need space
         */
        if (rb->flags & QB_RB_FLAG_OVERWRITE) {
                while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
                        int rc = _rb_chunk_reclaim(rb);
                        if (rc != 0) {
                                errno = rc;
                                return NULL;
                        }
                }

So do you know how should we control the value of 'len' and where it comes
to avoid failing to call function qb_rb_chunk_alloc(), I can reproduce this
problem by setting one of the corosync node nic into 5000 mtu, others are
staying at 1500 everywhere, is it related about the big packets throw to
the ringbuffer?


Thanks Christine,  very appreciate your reply : )


On Tue, Apr 21, 2015 at 8:33 PM, Christine Caulfield <[email protected]>
wrote:

> On 21/04/15 12:37, Hui Xiang wrote:
> > Thanks Christine.
> >
> > One more question, in the broken environment, we found part of the
> > source code in libqb as below:
> > 1)
> > void *
> > qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
> > {
> >         uint32_t write_pt;
> >
> >         if (rb == NULL) {
> >                 errno = EINVAL;
> >                 return NULL;
> >         }
> >         /*
> >          * Reclaim data if we are over writing and we need space
> >          */
> >         if (rb->flags & QB_RB_FLAG_OVERWRITE) {
> >                 while (qb_rb_space_free(rb) < (len +
> QB_RB_CHUNK_MARGIN)) {
> >                         *_rb_chunk_reclaim(rb);*
> >                 }
> >         } else {
> >                 if (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
> >                         errno = EAGAIN;
> >                         return NULL;
> >                 }
> >         }
> >
> > but in the master branch:
> > 2)
> >                 while (qb_rb_space_free(rb) < (len +
> QB_RB_CHUNK_MARGIN)) {
> > *                        int rc = _rb_chunk_reclaim(rb);*
> > *                        if (rc != 0) {*
> > *                                errno = rc;*
> > *                                return NULL;*
> >                         }
> >                 }
> >
> >
> > is it possible that the code 1) we have been stucked in the infinite
> > loop of
> > while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {...} on the
> > condition that 'chunk_magic != QB_RB_CHUNK_MAGIC', function
> > _rb_chunk_reclaim() just return:
> > static void
> > _rb_chunk_reclaim(struct qb_ringbuffer_s * rb)
> > {
> >         uint32_t old_read_pt;
> >         uint32_t new_read_pt;
> >         uint32_t old_chunk_size;
> >         uint32_t chunk_magic;
> >
> >         old_read_pt = rb->shared_hdr->read_pt;
> >         chunk_magic = QB_RB_CHUNK_MAGIC_GET(rb, old_read_pt);
> >    *    if (chunk_magic != QB_RB_CHUNK_MAGIC) {*
> > *                return;*
> > *        }*
> > *
> > *
> > *
> > *
> > and there is a commit seems fix it [1], do you know what's the
> > background of this commit? does it look to fix it?
> >
> > Thanks again :)
>
>
> I don't know enough about the background to that fix. What you're saying
> sounds plausible but I can't be sure. There are quite a few stability
> fixed in libqb 0.17 so it could be that or one of the others!
>
> Chrissie
>
>
> > [1]
> >
> https://github.com/ClusterLabs/libqb/commit/a8852fc481e3aa3fce53bb9e3db79d3e7cbed0c1
> >
> >
> >
> > On Tue, Apr 21, 2015 at 5:55 PM, Christine Caulfield
> > <[email protected] <mailto:[email protected]>> wrote:
> >
> >     Hiya,
> >
> >     It's hard to be sure without more information, sadly - if the
> backtrace
> >     looks similar to the one you mention then upgrading libqb to 0.17
> should
> >     help.
> >
> >     Chrissie
> >
> >     On 21/04/15 07:12, Hui Xiang wrote:
> >     > Thanks Christine, sorry for responding late.
> >     >
> >     > I got this problem again,  and corosync-blackbox just hang there,
> no
> >     > output. there are some other debug information for you guys.
> >     >
> >     > The backtrace and perf.data are very similar as link [1], but we
> don't
> >     > know what's the root cause, sure restart corosync is one of the
> >     > solution, but after a while it breaks again, so we'd like to find
> out
> >     > what's really going on there.
> >     >
> >     > Thanks for your efforts, very appreciated : )
> >     >
> >     > [1] http://www.spinics.net/lists/corosync/msg03445.html
> >     >
> >     >
> >     > On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <
> [email protected] <mailto:[email protected]>
> >     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >     >
> >     >     On 09/02/15 01:59, Hui Xiang wrote:
> >     >     > Hi guys,
> >     >     >
> >     >     >   I am having an issue with corosync where it consumes 100%
> >     cpu and hung on
> >     >     > the command corosync-quorumtool -l, Recv-Q is very high in
> >     the meantime
> >     >     > inside lxc container.
> >     >     >  corosync version : 2.3.3
> >     >     >
> >     >     >  transport : unicast
> >     >     >
> >     >     >  After setting up 3 keystone nodes with corosync/pacemaker,
> >     split brain
> >     >     > happened, on one of the keystone nodes we found the cpu is
> >     100% used by
> >     >     > corosync.
> >     >     >
> >     >
> >     >
> >     >     It looks like it might be a problem I saw while doing some
> >     development
> >     >     on corosync, if it gets a SEGV, there's a signal handler that
> >     catches it
> >     >     and relays it back to libqb via a pipe, causing another SEGV
> and
> >     >     corosync is then just spinning on the pipe for ever. The cause
> >     I saw is
> >     >     not likely yo be the same as yours (it was my coding at the
> >     time ;-) but
> >     >     it does sound like a similar effect. The only way round it is
> >     to kill
> >     >     corosync and restart it. There might be something in the
> >     >     corosync-blackbox to indicate what went wrong if that has been
> >     saved. If
> >     >     you have that then please post it here so we can have a look.
> >     >
> >     >     man corosync-blackbox
> >     >
> >     >     Chrissie
> >     >
> >     >     > **
> >     >     >
> >     >     > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
> >     >     > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi,
> >     0.0 si,
> >     >     0.0 st
> >     >     > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148
> buffers
> >     >     > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312
> >     cached Mem
> >     >     >
> >     >     >   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >     >     > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43
> >     corosync
> >     >     >
> >     >     > From netstat output, one interesting finding is the Recv-Q
> size
> >     >     has a value
> >     >     > 320256, which is higher than normal.
> >     >     > And after simply doing pkill -9 corosync and restart
> >     >     corosync/pacemaker,
> >     >     > the whole cluster are back normal.
> >     >     >
> >     >     > Active Internet connections (only servers)
> >     >     > Proto Recv-Q Send-Q Local Address Foreign Address State
> >     >     PID/Program name
> >     >     > udp 320256 0 192.168.100.67:5434
> >     <http://192.168.100.67:5434> <http://192.168.100.67:5434>
> >     >     0.0.0.0:* 18637/corosync
> >     >     >
> >     >     > Udp:
> >     >     >     539832 packets received
> >     >     >     619 packets to unknown port received.
> >     >     >     407249 packet receive errors
> >     >     >     1007262 packets sent
> >     >     >     RcvbufErrors: 69940
> >     >     >
> >     >     > **
> >     >     >
> >     >     >   So I am asking if there is any bug/issue related with
> corosync
> >     >     may cause
> >     >     > it slowly receive packets from socket and hung up due to
> some reason?
> >     >     >
> >     >     >   Thanks a lot, looking forward for your response.
> >     >     >
> >     >     >
> >     >     > Best Regards.
> >     >     >
> >     >     > Hui.
> >     >     >
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > discuss mailing list
> >     >     > [email protected] <mailto:[email protected]>
> >     <mailto:[email protected] <mailto:[email protected]>>
> >     >     > http://lists.corosync.org/mailman/listinfo/discuss
> >     >     >
> >     >
> >     >     _______________________________________________
> >     >     discuss mailing list
> >     >     [email protected] <mailto:[email protected]>
> >     <mailto:[email protected] <mailto:[email protected]>>
> >     >     http://lists.corosync.org/mailman/listinfo/discuss
> >     >
> >     >
> >
> >
>
>

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] Corosync consume 100% cpu with high Recv-Q and hung

Reply via email to