Jie Yu created MESOS-9307:
-----------------------------

             Summary: Libprocess should have a way to detect stuck actor.
                 Key: MESOS-9307
                 URL: https://issues.apache.org/jira/browse/MESOS-9307
             Project: Mesos
          Issue Type: Improvement
          Components: libprocess
            Reporter: Jie Yu


We spent two days on a bug, which turns out to be an infinite loop in an actor, 
blocking other events from being processed by that actor.

Currently, the only way to know about a stuck agent is to use gdb. We should 
think about a way to print error logs when an actor has stuck for more than a 
threshold.

For instance, Linux kernel will print a warning in kernel log if a task is 
stuck for more than 120 seconds. Something like this will be extremely helpful.

Another way is to expose some metrics around this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to