Jie Yu created MESOS-9307:
-----------------------------
Summary: Libprocess should have a way to detect stuck actor.
Key: MESOS-9307
URL: https://issues.apache.org/jira/browse/MESOS-9307
Project: Mesos
Issue Type: Improvement
Components: libprocess
Reporter: Jie Yu
We spent two days on a bug, which turns out to be an infinite loop in an actor,
blocking other events from being processed by that actor.
Currently, the only way to know about a stuck agent is to use gdb. We should
think about a way to print error logs when an actor has stuck for more than a
threshold.
For instance, Linux kernel will print a warning in kernel log if a task is
stuck for more than 120 seconds. Something like this will be extremely helpful.
Another way is to expose some metrics around this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)