Hi Eric, What is your Mesos Version?
Did you reboot the agent machine before task getting stuck? If yes, probably https://issues.apache.org/jira/browse/MESOS-9501 Did you enabled health check for that task? It may increase the chance of a potential FD leak: https://issues.apache.org/jira/browse/MESOS-9502 -Gilbert On Fri, Mar 8, 2019 at 3:03 PM Eric Chung <ech...@uber.com.invalid> wrote: > Hello devs, > > We recently ran into a situation where a task's executor was killed due to > registration timeout, but neither the executor nor the task was properly > killed, and the task has been stuck in queued_tasks for days. > > The relevant log: > > I0305 08:43:59.069857 5215 slave.cpp:6803] Terminating executor > '<executor_id>' of framework <framework_id> because it did not > register within 15mins > I0305 09:16:28.266021 5200 slave.cpp:3644] Asked to kill task > <task_id> of framework <framework_id> > W0305 09:16:28.266063 5200 slave.cpp:3816] Ignoring kill task > <task_id> because the executor '<executor_id>' of framework > <framework_id> is terminating > > > where the following just keeps repeating: > > I0305 09:16:28.266021 5200 slave.cpp:3644] Asked to kill task > <task_id> of framework <framework_id> > W0305 09:16:28.266063 5200 slave.cpp:3816] Ignoring kill task > <task_id> because the executor '<executor_id>' of framework > <framework_id> is terminating > > > the agent state indicates that it doesn't have any active tasks but a quite > a few queued tasks. > > Does anyone have any insight on why this might be happening? > > Thanks, > Eric >