How long does the cpu use last for when it does happen? Or does it stay at 100pct till restart? On 7 Nov 2014 10:21, "Emmanuel TAUREL" <tau...@esrf.fr> wrote:
> Hello all, > > We are using ZMQ (still release 3.2.4) mainly on Linux boxes. We are > using the PUB/SUB model. > Our system runs 24/7. From time to time, we have some of our PUB > processes eating 100 % of one core of our CPU's. > We don't know yet what exactly triggers this phenomenon and therefore we > are not able to repoduce it. It does not happen so often (once every 3/6 > months!!) > Nevertheless, we did some analysis last time it happens. > > Here are the result of "strace" on the PUB process > > 2889 10:53:18.021013 epoll_wait(19, {{EPOLLERR|EPOLLHUP, > {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1 > 2889 10:53:18.021041 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, > {u32=335547808, u64=140097873776032}}) = 0 > 2889 10:53:18.021068 epoll_wait(19, {{EPOLLERR|EPOLLHUP, > {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1 > 2889 10:53:18.021096 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, > {u32=335547808, u64=140097873776032}}) = 0 > 2889 10:53:18.021123 epoll_wait(19, {{EPOLLERR|EPOLLHUP, > {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1 > 2889 10:53:18.021151 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, > {u32=335547808, u64=140097873776032}}) = 0 > 2889 10:53:18.021178 epoll_wait(19, {{EPOLLERR|EPOLLHUP, > {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1 > 2889 10:53:18.021206 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, > {u32=335547808, u64=140097873776032}}) = 0 > 2889 10:53:18.021233 epoll_wait(19, {{EPOLLERR|EPOLLHUP, > {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1 > 2889 10:53:18.021260 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, > {u32=335547808, u64=140097873776032}}) = 0 > 2889 10:53:18.021288 epoll_wait(19, {{EPOLLERR|EPOLLHUP, > {u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1 > > From the number of couples epoll_wait()/epoll_ctl() and their period (2 > times in in 100 us), it is clear that this is this thread which eats the > CPU. > Form the flag returned by epoll_wait() (EPOLLERR|EPOLLHUP), it seems > that something wrong happens on one of the file descriptor (number 49 if > I look > at epoll_ctl() argument. It is confirmed by the result of "lsof" on the > same PUB process: > > Starter 2863 dserver 49u sock 0,6 0t0 7902 can't > identify protocol > > If I take control of the PUB process with gdb and if I request for this > thread stack trace, I have > > #0 0x00007fb65d3205ca in epoll_ctl () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007fb65e23c298 in zmq::epoll_t::reset_pollin (this=<optimized out>, > handle_=<optimized out>) at epoll.cpp:101 > #2 0x00007fb65e253da1 in zmq::stream_engine_t::in_event > (this=0x7fb6509d8c10) > at stream_engine.cpp:216 > #3 0x00007fb65e23c46b in zmq::epoll_t::loop (this=0x7fb6611c5b70) > at epoll.cpp:154 > #4 0x00007fb65e257de6 in thread_routine (arg_=0x7fb6611c5be0) at > thread.cpp:83 > #5 0x00007fb65de0d0a4 in start_thread () > from /lib/x86_64-linux-gnu/libpthread.so.0 > #6 0x00007fb65d32004d in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > Even if something wrong has happened on the socket associated to fd 49, > I think Zmq should not enter into a "crazy" loop. > Is it a known issue? > Is there something we could do to prevent this to happen anymore? > > Thank's in advance for your help > > Emmanuel > > > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > http://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev