[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589560#comment-16589560 ] Benno Evers commented on MESOS-9177: I think I identified the issue: The recently committed patch 2f4d9ae0 ("Batch '/state' requests on master") introduced a new code path that can lead to mutliple threads iterating over the same `completedTasks` circular_buffer in parallel. In theory this is fine, since iteration is read-only and the documentation for boost::circular_buffer explicitly states that parallel reads are thread-safe as long as no data is modified. However, the boost version that is used on the cluster where this segfault was observed is quite old (1.53), and in that version boost defaults to using checked debug iterators for iteration. These have a *mutable* pointer member m_next forming a mutable chain of iterators that is updated without synchronization whenever a new iterator is created or deleted, making it to iterate even over const versions of the same circular buffer. This was fixed in boost 2.5 years ago by the following commit {code} commit ea60799f315aa2e861d0e14ca9012950021c2fc6 Author: Andrey Semashev Date: Fri Apr 29 00:56:06 2016 +0300 Disable debug implementation by default The debug implementation is not thread-safe, even if different threads are using separate iterators for reading elements of the container. BOOST_CB_DISABLE_DEBUG macro is no longer used, BOOST_CB_ENABLE_DEBUG=1 should be defined instead to enable debug support. Fixes https://svn.boost.org/trac/boost/ticket/6277. {code} whicj also explains why we could not see the issue locally, since the Boost version bundled with Mesos already contains the fix. The issue should disappear by either upgrading the boost version, or by adding the `BOOST_CB_DISABLE_DEBUG=1` macro to the build process. > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 >
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589463#comment-16589463 ] Yan Xu commented on MESOS-9178: --- +1. Yup that's the approach we talked about. Sorry the JIRA didn't mention it. > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery
[ https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589384#comment-16589384 ] Jie Yu commented on MESOS-9174: --- Yeah, I think that can explain why the container gets killed when agent is restarted, because all container processes are now part of the agent's cgroup now (under systemd named hierarchy). > Unexpected containers transition from RUNNING to DESTROYING during recovery > --- > > Key: MESOS-9174 > URL: https://issues.apache.org/jira/browse/MESOS-9174 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.5.0, 1.6.1 >Reporter: Stephan Erb >Priority: Major > Attachments: mesos-agent.log, mesos-executor-stderr.log > > > I am trying to hunt down a weird issue where sometimes restarting a Mesos > agent takes down all Mesos containers. The containers die without an apparent > cause: > {code} > I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container > 02da7be0-271e-449f-9554-dc776adb29a9 has exited > I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container > 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state > I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state > of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING > {code} > From the perspective of the executor, there is nothing relevant in the logs. > Everything just stops directly as if the container gets terminated externally > without notifying the executor first. For further details, please see the > attached agent log and one (example) executor log file. > I am aware that this is a long shot, but anyone an idea what I should be > looking at to narrow down the issue? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589375#comment-16589375 ] Benjamin Mahler commented on MESOS-9178: Such a metric would be rather brittle, you only need 1 agent to not be able to re-register after a master failover for it to be useless. I would love to see some alternatives explored here, e.g. We could have some progress oriented metrics: * Time taken for failed over master to register (25%, 50%, 75%, 90%, 99% 100%) of agents. The metric described in this ticket would be the 100% case, but for most users, they'll probably monitor on a lower percentage. > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589312#comment-16589312 ] Andrei Budnik commented on MESOS-9131: -- [Test draft implementation|https://github.com/abudnik/mesos/commit/cf6e8cbc9aff4cdd350c1f13a2a37a3b5bce656e] [Fix draft implementation|https://github.com/abudnik/mesos/commit/a7b6a7d23e4a190e2d3215c02094c03a7cf72d3a] > Health checks launching nested containers while a container is being > destroyed lead to unkillable tasks > --- > > Key: MESOS-9131 > URL: https://issues.apache.org/jira/browse/MESOS-9131 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.5.1 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Blocker > Labels: container-stuck > > A container might get stuck in {{DESTROYING}} state if there's a command > health check that starts new nested containers while its parent container is > getting destroyed. > Here are some logs which unrelated lines removed. The > `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping > afterwards. > {noformat} > 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807] > Container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has > exited > 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354] > Destroying container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in > RUNNING state > 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968] > Transitioning the state of container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 > from RUNNING to DESTROYING > 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514] > Asked to destroy container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560] > Using freezer to destroy cgroup > mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 3.814144ms > 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.256599 3853 cgroups.cpp:1444] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 5.977856ms > ... > 2018-04-16 12:37:54: I0416 12:37:54.371117 3837 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd' > 2018-04-16 12:37:54: W0416 12:37:54.371692 3842 http.cpp:2758] Failed to > launch container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd: > Parent container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is > in 'DESTROYING' state > 2018-04-16 12:37:54: W0416 12:37:54.371826 3840 containerizer.cpp:2337] > Attempted to destroy unknown container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd > ... > 2018-04-16 12:37:55: I0416 12:37:55.504456 3856 http.cpp:3078] Processing > REMOVE_NESTED_CONTAINER call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6' > ... > 2018-04-16 12:37:55: I0416 12:37:55.556367 3857 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211' > ... > 2018-04-16 12:37:55: W0416 12:37:55.582137 3850 http.cpp:2758] Failed to > launch container >
[jira] [Comment Edited] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589281#comment-16589281 ] Benno Evers edited comment on MESOS-9177 at 8/22/18 7:29 PM: - As a preliminary update, I managed to narrow down the location of the segfault to this lambda inside the FullFrameworkWriter: {code} foreach (const Owned& task, framework_->completedTasks) { // Skip unauthorized tasks. if (!approvers_->approved(*task, framework_->info)) { continue; } writer->element(*task); } {code} or more precisely {code} # _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ + 0x203 1d0b913: 48 8b 51 08 mov0x8(%rcx),%rdx {code} Since the Mesos cluster where this segfault was observed runs with a non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I tried reproducing the crash by starting a mesos-master built from the same commit locally, using the `no-executor-framework` to run many tasks, and repeatedly hitting the state endpoint on this master. While I was able to overload the JSON renderer of my web browser, I didn't manage to reproduce the crash. Next, I turned to reverse engineering the exact location of the crash, which seems to be happening while trying to increase an `boost::circular_buffer::iterator` (i.e. the container of `Master::Framework::completedTasks`). This indicates that we're probably pushing values into this container while simulaneously iterating in another thread. However, I still haven't figured out a theory for how this could happen, or how to induce the crash locally, since all mutations seem to be happening on the Master actor and thus should not be happening in parallel. was (Author: bennoe): As a preliminary update, I managed to narrow down the location of the segfault to this lambda inside the FullFrameworkWriter: {code} foreach (const Owned& task, framework_->completedTasks) { // Skip unauthorized tasks. if (!approvers_->approved(*task, framework_->info)) { continue; } writer->element(*task); } {code} Since the Mesos cluster where this segfault was observed runs with a non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I tried reproducing the crash by starting a mesos-master built from the same commit locally, using the `no-executor-framework` to run many tasks, and repeatedly hitting the state endpoint on this master. While I was able to overload the JSON renderer of my web browser, I didn't manage to reproduce the crash. Next, I turned to reverse engineering the exact location of the crash, which seems to be happening while trying to increase an `boost::circular_buffer::iterator` (i.e. the container of `Master::Framework::completedTasks`). This indicates that we're probably pushing values into this container while simulaneously iterating in another thread. However, I still haven't figured out a theory for how this could happen, or how to induce the crash locally, since all mutations seem to be happening on the Master actor and thus should not be happening in parallel. > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 >
[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589281#comment-16589281 ] Benno Evers commented on MESOS-9177: As a preliminary update, I managed to narrow down the location of the segfault to this lambda inside the FullFrameworkWriter: {code} foreach (const Owned& task, framework_->completedTasks) { // Skip unauthorized tasks. if (!approvers_->approved(*task, framework_->info)) { continue; } writer->element(*task); } {code} Since the Mesos cluster where this segfault was observed runs with a non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I tried reproducing the crash by starting a mesos-master built from the same commit locally, using the `no-executor-framework` to run many tasks, and repeatedly hitting the state endpoint on this master. While I was able to overload the JSON renderer of my web browser, I didn't manage to reproduce the crash. Next, I turned to reverse engineering the exact location of the crash, which seems to be happening while trying to increase an `boost::circular_buffer::iterator` (i.e. the container of `Master::Framework::completedTasks`). This indicates that we're probably pushing values into this container while simulaneously iterating in another thread. However, I still haven't figured out a theory for how this could happen, or how to induce the crash locally, since all mutations seem to be happening on the Master actor and thus should not be happening in parallel. > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 > _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ > @ 0x7f36812215ac >
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589159#comment-16589159 ] James Peach commented on MESOS-9178: /cc [~bmahler] > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8345) Improve master responsiveness while serving state information.
[ https://issues.apache.org/jira/browse/MESOS-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-8345: -- Assignee: Alexander Rukletsov > Improve master responsiveness while serving state information. > -- > > Key: MESOS-8345 > URL: https://issues.apache.org/jira/browse/MESOS-8345 > Project: Mesos > Issue Type: Epic > Components: HTTP API, master >Reporter: Benjamin Mahler >Assignee: Alexander Rukletsov >Priority: Major > Labels: mesosphere, performance > > Currently when state is requested from the master, the response is built > using the master actor. This means that when the master is building an > expensive state response, the master is locked and cannot process other > events. This in turn can lead to higher latency on further requests to state. > Previous performance improvements to JSON generation (MESOS-4235) alleviated > this issue, but for large cluster with a lot of clients this can still be a > problem. > It's possible to serve state outside of the master actor by streaming the > state (re-using the existing streaming operator API) into another actor(s) > and serving from there. > NOTE: I believe this approach will incur a small performance cost during > master failover, since the master has to perform an additional copy of state > that it fans out. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9178) mesos metrics: master failover time
Xudong Ni created MESOS-9178: Summary: mesos metrics: master failover time Key: MESOS-9178 URL: https://issues.apache.org/jira/browse/MESOS-9178 Project: Mesos Issue Type: Improvement Components: master Reporter: Xudong Ni Quote from Yan Xu: Previous the argument against it is that you don't know if all agents are going to come back after a master failover so there's not a certain point that marks the end of "full reregistration of all agents". However empirically the number of agents usually don't change during the failover and there's an upper bound of such wait (after a 10min timeout the agents that haven't reregistered are going to be marked unreachable so we can just use that to stop the timer. So we can define failover time as "the time it takes for all agents recovered from the registry to be accounted for" i.e., either reregistered or marked as unreachable. This is of course looking at failover from an agent reregistration perspective. Later after we add framework info persistence, we can similarly define the framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9178) mesos metrics: master failover time
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Ni reassigned MESOS-9178: Assignee: Xudong Ni > mesos metrics: master failover time > --- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9177: - Shepherd: Alexander Rukletsov Assignee: Benno Evers Sprint: Mesosphere Sprint 2018-27 Story Points: 3 Target Version/s: 1.7.0 > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 > _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ > @ 0x7f36812215ac > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_ > @ 0x7f36821f3541 process::ProcessBase::consume() > @ 0x7f3682209fbc process::ProcessManager::resume() > @ 0x7f368220fa76 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f367eefc2b0 (unknown) > @ 0x7f367e71ae25 start_thread > @ 0x7f367e444bad __clone > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976 ] Qian Zhang edited comment on MESOS-8568 at 8/22/18 3:02 PM: Reproduce steps: 1. To simulate the failure of launching nested container via health check, change `CgroupsIsolatorProcess::isolate` a bit: {code:java} Future CgroupsIsolatorProcess::isolate( const ContainerID& containerId, pid_t pid) { + if (strings::startsWith(containerId.value(), "check")) { +return Failure("==Fake error=="); + } + {code} 2. Start Mesos master and agent. {code:java} $ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos $ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --work_dir=/home/qzhang/opt/mesos --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem {code} 3. Launch a nested container with check enabled. {code:java} $ cat task_group_health_check.json { "tasks":[ { "name" : "test", "task_id" : {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} ], "command": { "value": "touch aaa && sleep 1000" }, "check": { "type": "COMMAND", "command": { "command": { "value": "ls aaa > /dev/null" } }, "delay_seconds": 5, "interval_seconds": 3 } } ] } $ src/mesos-execute --master=10.0.49.2:5050 --task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code} 5. After a few minutes, there will be a lot of check container's sandbox directories not removed. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers | grep check | wc -l 119 {code} And in the default executor's stderr, we see a lot of warning messages. {code:java} ... W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' used for the COMMAND check for task 'test' I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12' used for the COMMAND check for task 'test' I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 'test' is not available ...{code} So every time when the default executor called `REMOVE_NESTED_CONTAINER` to remove the previous check container, the call failed with a 500 error. The reason that this call failed is the check container has not terminated yet (still in `DESTROYING` state), the agent log below also proved this. {code:java} I0822 07:37:45.051453 19063 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module finished preparing container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586; IOSwitchboard server is required I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 'mesos_executors.slice' I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server (pid: 19410) listening on socket file '/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 and
[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976 ] Qian Zhang edited comment on MESOS-8568 at 8/22/18 3:01 PM: Reproduce steps: 1. To simulate the failure of launching nested container via health check, change `CgroupsIsolatorProcess::isolate` a bit: {code:java} Future CgroupsIsolatorProcess::isolate( const ContainerID& containerId, pid_t pid) { + if (strings::startsWith(containerId.value(), "check")) { +return Failure("==Fake error=="); + } + {code} 2. Start Mesos master and agent. {code:java} $ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos $ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --work_dir=/home/qzhang/opt/mesos --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem {code} 3. Launch a nested container with check enabled. {code:java} $ cat task_group_health_check.json { "tasks":[ { "name" : "test", "task_id" : {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} ], "command": { "value": "touch aaa && sleep 1000" }, "check": { "type": "COMMAND", "command": { "command": { "value": "ls aaa > /dev/null" } }, "delay_seconds": 5, "interval_seconds": 3 } } ] } $ src/mesos-execute --master=10.0.49.2:5050 --task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code} 5. After a few minutes, there will be a lot of check container's sandbox directories not removed. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers | grep check | wc -l 119 {code} And in the default executor's stderr, we see a lot of warning messages. {code:java} ... W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' used for the COMMAND check for task 'test' I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12' used for the COMMAND check for task 'test' I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 'test' is not available ...{code} So every time when the default executor called `REMOVE_NESTED_CONTAINER` to remove the previous check container, the call failed with a 500 error. The reason that this call failed is the check container has not terminated yet (still in `DESTROYING` state), the agent log below also proved this. {code:java} I0822 07:37:45.051453 19063 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module finished preparing container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586; IOSwitchboard server is required I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 'mesos_executors.slice' I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server (pid: 19410) listening on socket file '/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 and
[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976 ] Qian Zhang edited comment on MESOS-8568 at 8/22/18 2:56 PM: Reproduce steps: 1. To simulate the failure of launching nested container via health check, change `CgroupsIsolatorProcess::isolate` a bit: {code:java} Future CgroupsIsolatorProcess::isolate( const ContainerID& containerId, pid_t pid) { + if (strings::startsWith(containerId.value(), "check")) { +return Failure("==Fake error=="); + } + {code} 2. Start Mesos master and agent. {code:java} $ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos $ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --work_dir=/home/qzhang/opt/mesos --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem {code} 3. Launch a nested container with check enabled. {code:java} $ cat task_group_health_check.json { "tasks":[ { "name" : "test", "task_id" : {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} ], "command": { "value": "touch aaa && sleep 1000" }, "check": { "type": "COMMAND", "command": { "command": { "value": "ls aaa > /dev/null" } }, "delay_seconds": 5, "interval_seconds": 3 } } ] } $ src/mesos-execute --master=10.0.49.2:5050 --task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code} 5. After a few minutes, there will be a lot of check container's sandbox directories not removed. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers | grep check | wc -l 119 {code} And in the default executor's stderr, we see a lot of warning messages {code:java} ... W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' used for the COMMAND check for task 'test' I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12' used for the COMMAND check for task 'test' I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 'test' is not available ...{code} So every time when the default executor called `REMOVE_NESTED_CONTAINER` to remove the previous check container, the call will fail with a 500 error. The reason that this call failed is the check container has not terminated yet (still in `DESTROYING` state), the agent log below also proved this. {code:java} I0822 07:37:45.051453 19063 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module finished preparing container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586; IOSwitchboard server is required I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 'mesos_executors.slice' I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server (pid: 19410) listening on socket file '/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976 ] Qian Zhang commented on MESOS-8568: --- Reproduce steps: 1. To simulate the failure of launching nested container via health check, change `CgroupsIsolatorProcess::isolate` a bit: {code:java} Future CgroupsIsolatorProcess::isolate( const ContainerID& containerId, pid_t pid) { + if (strings::startsWith(containerId.value(), "check")) { +return Failure("==Fake error=="); + } + {code} 2. Start Mesos master and agent. {code:java} $ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos $ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --port=36251 --work_dir=/home/qzhang/opt/mesos --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem {code} 3. Launch a nested container with check enabled. {code:java} $ cat task_group_health_check.json { "tasks":[ { "name" : "test", "task_id" : {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} ], "command": { "value": "touch aaa && sleep 1000" }, "check": { "type": "COMMAND", "command": { "command": { "value": "ls aaa > /dev/null" } }, "delay_seconds": 5, "interval_seconds": 3 } } ] } $ src/mesos-execute --master=10.0.49.2:5050 --task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code} 5. After a few minutes, there will be a lot of check container's sandbox directories not removed. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers | grep check | wc -l 119 {code} And in the default executor's stderr, we see a lot of warning messages {code:java} ... W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' used for the COMMAND check for task 'test' I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==Fake error==) while launching COMMAND check for task 'test' W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12' used for the COMMAND check for task 'test' I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 'test' is not available ...{code} So every time when the default executor called `REMOVE_NESTED_CONTAINER` to remove the previous check container, the call will fail with a 500 error. The reason that this call failed is the check container has not terminated yet (still in `DESTROYING` state), the agent log below also proved this. {code:java} I0822 07:37:45.051453 19063 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module finished preparing container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586; IOSwitchboard server is required I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 'mesos_executors.slice' I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server (pid: 19410) listening on socket file '/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 and cloning with namespaces
[jira] [Created] (MESOS-9177) Mesos master segfaults when responding to /state requests.
Alexander Rukletsov created MESOS-9177: -- Summary: Mesos master segfaults when responding to /state requests. Key: MESOS-9177 URL: https://issues.apache.org/jira/browse/MESOS-9177 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.7.0 Reporter: Alexander Rukletsov {noformat} *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; stack trace: *** @ 0x7f367e7226d0 (unknown) @ 0x7f3681266913 _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ @ 0x7f3681266af0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f36812882d0 mesos::internal::master::FullFrameworkWriter::operator()() @ 0x7f36812889d0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f368121aef0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f3681241be3 _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ @ 0x7f3681242760 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv @ 0x7f368215f60e process::http::OK::OK() @ 0x7f3681219061 _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ @ 0x7f36812212c0 _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ @ 0x7f36812215ac _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_ @ 0x7f36821f3541 process::ProcessBase::consume() @ 0x7f3682209fbc process::ProcessManager::resume() @ 0x7f368220fa76 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f367eefc2b0 (unknown) @ 0x7f367e71ae25 start_thread @ 0x7f367e444bad __clone {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery
[ https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1651#comment-1651 ] Stephan Erb commented on MESOS-9174: [~jieyu], we have found something interesting related to [https://reviews.apache.org/r/62800] I have not checked the entire cluster, but on a first sight it seems as if there are problems related to systemd. *Nodes without recovery issues* Those are running Mesos 1.6.1 and systemd 232-25+deb9u4 (Debian Stretch). {code:java} $ systemd-cgls Control group /: -.slice ├─mesos │ ├─9b70ff19-238c-4520-978c-688b83e705ce │ │ ├─ 5129 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch ... ├─init.scope │ └─1 /sbin/init └─system.slice ├─mesos-agent.service │ └─1472 /usr/sbin/mesos-agent --master=file:///etc/mesos-agent/zk ... {code} *Nodes with recovery issues* Those are running Mesos 1.6.1 and systemd 237-3~bpo9+1 (Debian Stretch backports). {code:java} $ systemd-cgls Control group /: -.slice ├─init.scope │ └─1 /sbin/init └─system.slice ├─mesos-agent.service │ ├─ 19151 haproxy -f haproxy.cfg -p haproxy.pid -sf 149 │ ├─ 39633 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch │ ├─ 39638 sh -c ${MESOS_SANDBOX=.}/thermos_executor_wrapper.sh │ ├─ 39639 python2.7 ... │ ├─ 39684 /usr/bin/python2.7... │ ├─ 39710 /usr/bin/python2.7 ... │ ├─ 39714 /usr/bin/ruby /usr/bin/synapse -c synapse.conf │ ├─ 39775 haproxy -f haproxy.cfg -p haproxy.pid -sf │ ├─ 39837 /usr/bin/python2.7 ... {code} In particular, there is no {{mesos}} group/section even though this is perfectly show using systemd-cgtop: {code:java} /1700 -45.2G-- /mesos - -38.4G-- /mesos/144dc11d-dbd0-42e8-89e4-a72384e777df - - 1.6G-- /mesos/15a8e488-495c-4db8-a11b-7e8277ec4c93 - - 3.1G-- /mesos/2a2b5913-2445-4111-9d18-71abc9f1f8cd - - 1021.2M-- /mesos/2e1c5c91-6a80-4242-b105-023c1eb2c89d - - 2.6G-- /mesos/356c5c0f-2ae0-4dfc-9415-d1dbeb172542 - - 898.4M-- /mesos/3baf4930-4332-4206-91d5-d39ea6bb3389 - - 3.1G-- /mesos/3d1b9554-911d-44ee-b204-fe622f02ef7a - - 845.0M-- /mesos/431aa2a0-11e4-4bf3-b888-ee10cf689326 - - 1.3G-- /mesos/94f8e3bb-360a-4694-9359-4da10cb4e5df - - 1.2G-- /mesos/9d1b3251-6c61-404e-88d0-03319d1a508c - - 3.2G-- /mesos/b5bb9133-4093-4bc6-90c1-3656b20559bf - - 417.6M-- /mesos/b89095dd-21bc-4255-86c8-14bd7cd0ac2a - - 1.5G-- /system.slice1137 - 8.7G-- {code} The output of the faulty node above is from the same node that I have used to pull the mesos-agent.log. I will try to reproduce the issue by upgrading systemd in another test environment and then report back. Newer systemd version have a changed behaviour of {{Delegate=}} which could indeed be related to the observed issue. > Unexpected containers transition from RUNNING to DESTROYING during recovery > --- > > Key: MESOS-9174 > URL: https://issues.apache.org/jira/browse/MESOS-9174 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.5.0, 1.6.1 >Reporter: Stephan Erb >Priority: Major > Attachments: mesos-agent.log, mesos-executor-stderr.log > > > I am trying to hunt down a weird issue where sometimes restarting a Mesos > agent takes down all Mesos containers. The containers die without an apparent > cause: > {code} > I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container > 02da7be0-271e-449f-9554-dc776adb29a9 has exited > I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container > 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state > I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state > of container
[jira] [Created] (MESOS-9176) Mesos does not work properly on modern Ubuntu distributions.
Alexander Rukletsov created MESOS-9176: -- Summary: Mesos does not work properly on modern Ubuntu distributions. Key: MESOS-9176 URL: https://issues.apache.org/jira/browse/MESOS-9176 Project: Mesos Issue Type: Epic Affects Versions: 1.7.0 Environment: Ubuntu 17.10 Ubuntu 18.04 Reporter: Alexander Rukletsov We have observed several issues in various components on moder Ubuntus, e.g., 17.10, 18.04. Needless to say, we need to ensure Mesos compiles and runs fine on those distros. -- This message was sent by Atlassian JIRA (v7.6.3#76005)