Here are some relevant logs. Aurora scheduler logs shows the task going
from:
INIT
->PENDING
->ASSIGNED
->STARTING
->RUNNING for a long time
->FAILED due to health check error, OSError: Resource temporarily
unavailable (I think this is referring to running out of PID space, see
thermos logs below)
Can you share the agent and executor logs of an example orphaned executor?
That would help us diagnose the issue.
On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi wrote:
> Folks,
> Often I see some orphaned executors in my cluster. These are cases where
> the framework was
Folks,
Often I see some orphaned executors in my cluster. These are cases where
the framework was informed of task loss, so has forgotten about them as
expected, but the container(docker) is still around. AFAIK, Mesos agent is
the only entity that has knowledge of these containers. How do I ensure
3 matches
Mail list logo