[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025625#comment-16025625 ] Benjamin Mahler commented on MESOS-5332: In order to enable users who hit this situation to safely upgrade (without all >5 day idle connection executors being destroyed), we will introduce an optional retry of the reconnect message via MESOS-7569: https://reviews.apache.org/r/59584/ This will allow the preservation of executors without the relink (MESOS-7057) fix when upgrading an agent. Longer term, TCP keepalives or heartbeating will be put in place to avoid the connections timing out in conntrack. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: agent, libprocess >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb >Assignee: Anand Mazumdar > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. log messages are in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280897#comment-15280897 ] Anand Mazumdar commented on MESOS-5332: --- [~StephanErb] That took some catching! Since, we have identified the root cause and filed corresponding tickets for the action items on our end i.e. MESOS-5361/MESOS-5364 (also linked to this JIRA). I am resolving this issue. Feel free to re-open if you have any further queries/concerns. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. log messages are in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279791#comment-15279791 ] Stephan Erb commented on MESOS-5332: I was able to assemble a reproducing example (using Aurora master and Mesos 0.27.2): https://gist.github.com/StephanErb/5798b0d87c11473fb0ec147272ea0288 Summary of events: * An iptables firewall is terminating idle TCP connections after the iptables default of 5 days (reduced to 60 seconds in the example above). * Mesos does not detect broken, half-open TCP connections that occur when connections are terminated by iptables. * Mesos tries to use the old, broken TCP connection when answering the agent reconnect request. The message therefore never makes it to the agent. * The agent ends up killing the executor because it does not receive a reply for its reconnect request. I'd conclude that there are several areas that need improvement: * *Firewalling*: We have to fix our inhouse iptables firewall scripts so that it does not apply connection tracking to local connections. * *KeepAlive*: Mesos has to enable TCP keepalives in libprocess. As detailed here (http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/) this has two advantages: ** Detection of dead peers in case there has been a hard crash, an unplugged network cable, or if a firewall has silently dropped a connection. ** Prevention of disconnection due to network inactivity. * *Unlink* By the addition of an {{unlink}} function libprocess could handle exit events more gracefully by making sure it always creates a new connection when talking to a new process. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transpor
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279009#comment-15279009 ] Anand Mazumdar commented on MESOS-5332: --- Noting things down here based on our offline discussion on IRC for posterity. >>> Due to the nature of TCP, the faulty executors are only learning on this >>> write attempt that their persistent connections have become stale, i.e. >>> they generate the ExitedEvent only now. This is not entirely true. Looking at the linux kernel source, the kernel sends a RST/FIN depending on if the {{recv}} queue is empty or not upon a process exit: http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L2019 . Hence, I don't think having explicit TCP KeepAlive's would help us here. They can help us though for master <-> scheduler, master <-> agent communication. It is still not clear to me why the SYN/RST from the kernel did not reach the executor. IIRC, you said that your agent had IPTable rules set up. Could they have contributed to this? > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275412#comment-15275412 ] Anand Mazumdar commented on MESOS-5332: --- It doesn't. {{send}} just provides at most once semantics. Messages can be dropped at any point of time. The responsibility to retry is left upon as the business logic of the client/library invoking the {{send}} operation. We don't explicitly do a retry here in the {{ExecutorDriver}} implementation as the communication happens on the same host. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275326#comment-15275326 ] Stephan Erb commented on MESOS-5332: My analysis of the events is slightly different and leads to a different conclusion: 1. Executor and agent are running for several days. There are no events send from executor to agent during that time. This leads to some persistent connections from executors to the agent silently becoming stale. 2. The agent is restarted. Executors with working TCP connections detect this, those with faulty ones however don't. 3. The agent process sends a {{ReconnectExecutorMessage}}. 4. The executors try to send {{ExecutorReregisterMessage}} as a reply. Due to the nature of TCP, the faulty executors are only learning on this write attempt that their persistent connections have become stale, i.e. they generate the {{ExitedEvent}} only now. Proposed fix: We have to discover that the persistent connections have become stale before we actually need them. We can achieve this by periodically sending data over the socket. TCP will then discover that the connection has gone bad. There seem to be two approaches here: a) We enable layer 4 connection health checking by using TCP keepalives (via {{SO_KEEPALIVE}}) on all sockets. b) We make use of layer 7 connection health checking by periodically sending a no-op / whitespace over the persistent HTTP connection. FWIW this recent SIGPIPE commit might have already improved the closed socket detection https://github.com/apache/mesos/commit/b0dcf4183a9125d0346688fa4f413658508aae68 > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, executor-stderrV2.log, > mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected >
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275314#comment-15275314 ] Stephan Erb commented on MESOS-5332: The observation that it takes 5 seconds for a faulty executor to learn about the agent being down tends to be true for all failed executors: {code} $ ./analyse-executor-logs.sh SURVIVING EXECUTOR: I0505 08:41:59.581074 21098 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.580765 10388 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.581061 8492 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.580983 17068 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.581048 27932 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.581203 3738 exec.cpp:456] Slave exited, SURVIVING EXECUTOR: I0505 08:41:59.580838 32272 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:05.510376 3335 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:05.509397 2985 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:04.784864 8158 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:05.481689 2536 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:05.509380 2941 exec.cpp:456] Slave exited, FAILED EXECUTOR: I0505 08:42:04.783973 17405 exec.cpp:456] Slave exited, {code} Question regarding Step 4: Shouldn't libprocess try to retry the send operation? > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, executor-stderrV2.log, > mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exite
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274881#comment-15274881 ] Anand Mazumdar commented on MESOS-5332: --- [~bmahler] and me again investigated this today. We think we have the probable root cause: It seems that the faulty executors did not realize the disconnection with the agent for a long time (~5 seconds based on the logs compared to a good executor). Timeline: 1. Agent process is restarted. 2. Agent process sends a {{ReconnectExecutorMessage}} to the faulty executor. 3. The faulty executor tries to {{link}} with the agent UPID. Please not that the {{PID}} of the agent is not changed upon an agent restart. So, we did not try to establish a connection with the agent again since a link already existed. https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1453 4. The executor tried to send a {{ExecutorReregisterMessage}} on the faulty old link. The message would obviously not make it to the agent. 5. The executor then detected that the previous link had broken and issued an {{ExitedEvent}}. The interesting bit is that the faulty executors got to know about the {{ExitedEvent}} a good 5 seconds after the restart when compared to the other good executors. (08:42:05 vs 08:41:59) In order to fix this, we would need to add an {{unlink}} function to libprocess i.e. the ability to explicitly override our intent of no longer being interested in the {{ExitedEvent}} along with the ability for unique agent PID's. Currently, the agent has the same PID i.e {{slave(1)@IP:PORT}} upon an agent restart. [~StephanErb] Would it be possible to attach the logs from a few other good executors so that we can validate that all of them were able to get the {{ExitedEvent}} around {{08:41:59}} to confirm our hypothesis? > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, executor-stderrV2.log, > mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273776#comment-15273776 ] Stephan Erb commented on MESOS-5332: All 7 killed executors have the same offending log messages. I have attached the V1 log of another executor that failed during the restart. It seems like that the linking was successful. Thanks for looking into this. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, executor-stderrV2.log, > mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273336#comment-15273336 ] Anand Mazumdar commented on MESOS-5332: --- [~bmahler] and me went through the agent logs in more detail. There were 15 reconnect requests issued by the agent. 8 of the executors re-registered while the agent killed 7 of them. So, it does not seem to be a SYN flood issue as I had alluded to earlier. However, the curious thing is why the {{link}} on the executor failed so fast. To investigate further, can you try out one of the following two options: - Can you replace the following line: https://github.com/apache/mesos/blob/0.26.0/3rdparty/libprocess/src/process.cpp#L1281 with: {code} if (future.isDiscarded() || future.isFailed()) { if (future.isFailed()) { - VLOG(1) << "Failed to link, connect: " << future.failure(); + LOG(ERROR) << "Failed to link, connect: " << future.failure() + << " error code: " << os::strerror(errno); } {code} or - Can you start your executors with verbose logging enabled i.e. set the {{GLOG_v=1}} environment variable. I am also trying to reproduce this issue on our end. Thanks again for your patience. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273276#comment-15273276 ] Anand Mazumdar commented on MESOS-5332: --- Correction: Some of the ~9400 might have already been terminated. So, not all of them would be active upon agent restart. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273268#comment-15273268 ] Anand Mazumdar commented on MESOS-5332: --- [~StephanErb] Thanks for uploading the logs. Can you confirm that you always see this line in almost all the failed executors. {code} E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected {code} >From the logs, it seems to be that the executor wasn't able to send the >{{ExecutorReregisterMessage}} to the agent to re-register itself owing to not >been able to establish connection with the agent. This might be due to the >fact that we had close to ~9400 executors trying to establish a TCP connection >with the agent (SYN flood) simultaneously. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-stderr.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273171#comment-15273171 ] Stephan Erb commented on MESOS-5332: {code} $ ls /var/lib/mesos/meta/slaves/latest/frameworks/20151001-085346-58917130-5050-37976-/executors/* | wc -l 9439 {code} That value is quite large this time. However I have also once seen the problem on an agent with only ~900 terminated executors. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor.stderr, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272952#comment-15272952 ] Anand Mazumdar commented on MESOS-5332: --- [~StephanErb] Thanks for reporting this issue after the initial discussion with Vinod on IRC. Before it's too late, would it be possible to upload the full executor logs (for the failed scenario)? I want to rule out the possibility that for some reason we might have got duplicated {{ExitedEvent}}'s for the agent process restart. Also, it would be good to have the full agent logs when you saw the above occurrence for the failed executor. Also, on an average how large was your {{/meta/slaves/latest/frameworks//executors}} when you were seeing this behavior? > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272937#comment-15272937 ] Stephan Erb commented on MESOS-5332: [~vinodkone] we have talked about this issue before (http://wilderness.apache.org/channels/?f=mesos/2016-04-20#1461173018). However, I was not able to link this problem to a large number of terminated executors: * Since then, I've also seen it on an agent with only a couple of hundred terminated executors * I was not able to reproduce in a small virtual machine with ~10k terminated executors > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration: > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. messages seem to be received in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)