How long does the cpu use last for when it does happen? Or does it stay at
100pct till restart?
On 7 Nov 2014 10:21, "Emmanuel TAUREL" <tau...@esrf.fr> wrote:

> Hello all,
>
> We are using ZMQ (still release 3.2.4) mainly on Linux boxes. We are
> using the PUB/SUB model.
> Our system runs 24/7. From time to time, we have some of our PUB
> processes eating 100 % of one core of our CPU's.
> We don't know yet what exactly triggers this phenomenon and therefore we
> are not able to repoduce it. It does not happen so often (once every 3/6
> months!!)
> Nevertheless, we did some analysis last time it happens.
>
> Here are the result of "strace" on the PUB process
>
> 2889  10:53:18.021013 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
> {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
> 2889  10:53:18.021041 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
> {u32=335547808, u64=140097873776032}}) = 0
> 2889  10:53:18.021068 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
> {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
> 2889  10:53:18.021096 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
> {u32=335547808, u64=140097873776032}}) = 0
> 2889  10:53:18.021123 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
> {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
> 2889  10:53:18.021151 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
> {u32=335547808, u64=140097873776032}}) = 0
> 2889  10:53:18.021178 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
> {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
> 2889  10:53:18.021206 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
> {u32=335547808, u64=140097873776032}}) = 0
> 2889  10:53:18.021233 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
> {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
> 2889  10:53:18.021260 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
> {u32=335547808, u64=140097873776032}}) = 0
> 2889  10:53:18.021288 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
> {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
>
>  From the number of couples epoll_wait()/epoll_ctl() and their period (2
> times in in 100 us), it is clear that this is this thread which eats the
> CPU.
> Form the flag returned by epoll_wait() (EPOLLERR|EPOLLHUP), it seems
> that something wrong happens on one of the file descriptor (number 49 if
> I look
> at epoll_ctl() argument. It is confirmed by the result of "lsof" on the
> same PUB process:
>
> Starter 2863 dserver   49u  sock                0,6      0t0 7902 can't
> identify protocol
>
> If I take control of the PUB process with gdb and if I request for this
> thread stack trace, I have
>
> #0  0x00007fb65d3205ca in epoll_ctl () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007fb65e23c298 in zmq::epoll_t::reset_pollin (this=<optimized out>,
>      handle_=<optimized out>) at epoll.cpp:101
> #2  0x00007fb65e253da1 in zmq::stream_engine_t::in_event
> (this=0x7fb6509d8c10)
>      at stream_engine.cpp:216
> #3  0x00007fb65e23c46b in zmq::epoll_t::loop (this=0x7fb6611c5b70)
>      at epoll.cpp:154
> #4  0x00007fb65e257de6 in thread_routine (arg_=0x7fb6611c5be0) at
> thread.cpp:83
> #5  0x00007fb65de0d0a4 in start_thread ()
>     from /lib/x86_64-linux-gnu/libpthread.so.0
> #6  0x00007fb65d32004d in clone () from /lib/x86_64-linux-gnu/libc.so.6
>
> Even if something wrong has happened on the socket associated to fd 49,
> I think Zmq should not enter into a "crazy" loop.
> Is it a known issue?
> Is there something we could do to prevent this to happen anymore?
>
> Thank's in advance for your help
>
> Emmanuel
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to