we're at 1.6.0. not sure if it was rebooted, but the symptoms to look suspiciously similar to MESOS-9501. we're due for an upgrade anyway, will probably go that route. thanks!
On Fri, Mar 8, 2019 at 4:30 PM Gilbert Song <gilb...@apache.org> wrote: > Hi Eric, > > What is your Mesos Version? > > Did you reboot the agent machine before task getting stuck? > If yes, probably > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_MESOS-2D9501&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=QZ4VpVRZVz7miVYNAqeI5w&m=lZx-zyTWKmMMvu3VP1VAxi8k6bda-ZNlxsjLYt7CU6g&s=Lfb07EzsF6I9hqEDiMJ8bmc52hJcNrSr3-X1NGGCfqs&e= > > Did you enabled health check for that task? > It may increase the chance of a potential FD leak: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_MESOS-2D9502&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=QZ4VpVRZVz7miVYNAqeI5w&m=lZx-zyTWKmMMvu3VP1VAxi8k6bda-ZNlxsjLYt7CU6g&s=M2wKgv42iZJmTzAblMpaONL-A0IPGB4lXaCp3ntUWis&e= > > -Gilbert > > > On Fri, Mar 8, 2019 at 3:03 PM Eric Chung <ech...@uber.com.invalid> wrote: > > > Hello devs, > > > > We recently ran into a situation where a task's executor was killed due > to > > registration timeout, but neither the executor nor the task was properly > > killed, and the task has been stuck in queued_tasks for days. > > > > The relevant log: > > > > I0305 08:43:59.069857 5215 slave.cpp:6803] Terminating executor > > '<executor_id>' of framework <framework_id> because it did not > > register within 15mins > > I0305 09:16:28.266021 5200 slave.cpp:3644] Asked to kill task > > <task_id> of framework <framework_id> > > W0305 09:16:28.266063 5200 slave.cpp:3816] Ignoring kill task > > <task_id> because the executor '<executor_id>' of framework > > <framework_id> is terminating > > > > > > where the following just keeps repeating: > > > > I0305 09:16:28.266021 5200 slave.cpp:3644] Asked to kill task > > <task_id> of framework <framework_id> > > W0305 09:16:28.266063 5200 slave.cpp:3816] Ignoring kill task > > <task_id> because the executor '<executor_id>' of framework > > <framework_id> is terminating > > > > > > the agent state indicates that it doesn't have any active tasks but a > quite > > a few queued tasks. > > > > Does anyone have any insight on why this might be happening? > > > > Thanks, > > Eric > > >