[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589560#comment-16589560
 ] 

Benno Evers commented on MESOS-9177:


I think I identified the issue: The recently committed patch 2f4d9ae0 ("Batch 
'/state' requests on master") introduced a new code path that can lead to 
mutliple threads iterating over the same `completedTasks` circular_buffer in 
parallel.

In theory this is fine, since iteration is read-only and the documentation for 
boost::circular_buffer explicitly states that parallel reads are thread-safe as 
long as no data is modified.

However, the boost version that is used on the cluster where this segfault was 
observed is quite old (1.53), and in that version boost defaults to using 
checked debug iterators for iteration. These have a *mutable* pointer member 
m_next forming a mutable chain of iterators that is updated without 
synchronization whenever a new iterator is created or deleted, making it to 
iterate even over const versions of the same circular buffer.

This was fixed in boost 2.5 years ago by the following commit
{code}
commit ea60799f315aa2e861d0e14ca9012950021c2fc6
Author: Andrey Semashev 
Date:   Fri Apr 29 00:56:06 2016 +0300

Disable debug implementation by default

The debug implementation is not thread-safe, even if different threads are 
using separate iterators for reading elements of the container. 
BOOST_CB_DISABLE_DEBUG macro is no longer used, BOOST_CB_ENABLE_DEBUG=1 should 
be defined instead to enable debug support.

Fixes https://svn.boost.org/trac/boost/ticket/6277.
{code}


whicj also explains why we could not see the issue locally, since the Boost 
version bundled with Mesos already contains the fix.

The issue should disappear by either upgrading the boost version, or by adding 
the `BOOST_CB_DISABLE_DEBUG=1` macro to the build process.

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> 

[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589463#comment-16589463
 ] 

Yan Xu commented on MESOS-9178:
---

+1. Yup that's the approach we talked about. Sorry the JIRA didn't mention it.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery

2018-08-22 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589384#comment-16589384
 ] 

Jie Yu commented on MESOS-9174:
---

Yeah, I think that can explain why the container gets killed when agent is 
restarted, because all container processes are now part of the agent's cgroup 
now (under systemd named hierarchy).



> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---
>
> Key: MESOS-9174
> URL: https://issues.apache.org/jira/browse/MESOS-9174
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0, 1.6.1
>Reporter: Stephan Erb
>Priority: Major
> Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos 
> agent takes down all Mesos containers. The containers die without an apparent 
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container 
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container 
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state 
> of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING
> {code}
> From the perspective of the executor, there is nothing relevant in the logs. 
> Everything just stops directly as if the container gets terminated externally 
> without notifying the executor first. For further details, please see the 
> attached agent log and one (example) executor log file.
> I am aware that this is a long shot, but anyone an idea what I should be 
> looking at to narrow down the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589375#comment-16589375
 ] 

Benjamin Mahler commented on MESOS-9178:


Such a metric would be rather brittle, you only need 1 agent to not be able to 
re-register after a master failover for it to be useless. I would love to see 
some alternatives explored here, e.g.

We could have some progress oriented metrics:
* Time taken for failed over master to register (25%, 50%, 75%, 90%, 99% 100%) 
of agents. The metric described in this ticket would be the 100% case, but for 
most users, they'll probably monitor on a lower percentage.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-08-22 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589312#comment-16589312
 ] 

Andrei Budnik commented on MESOS-9131:
--

[Test draft 
implementation|https://github.com/abudnik/mesos/commit/cf6e8cbc9aff4cdd350c1f13a2a37a3b5bce656e]

[Fix draft 
implementation|https://github.com/abudnik/mesos/commit/a7b6a7d23e4a190e2d3215c02094c03a7cf72d3a]

> Health checks launching nested containers while a container is being 
> destroyed lead to unkillable tasks
> ---
>
> Key: MESOS-9131
> URL: https://issues.apache.org/jira/browse/MESOS-9131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.5.1
>Reporter: Jan Schlicht
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: container-stuck
>
> A container might get stuck in {{DESTROYING}} state if there's a command 
> health check that starts new nested containers while its parent container is 
> getting destroyed.
> Here are some logs which unrelated lines removed. The 
> `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping 
> afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] 
> Container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has 
> exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] 
> Destroying container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in 
> RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] 
> Transitioning the state of container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 
> from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] 
> Asked to destroy container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] 
> Using freezer to destroy cgroup 
> mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.256599  3853 cgroups.cpp:1444] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 5.977856ms
> ...
> 2018-04-16 12:37:54: I0416 12:37:54.371117  3837 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
> 2018-04-16 12:37:54: W0416 12:37:54.371692  3842 http.cpp:2758] Failed to 
> launch container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd:
>  Parent container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is 
> in 'DESTROYING' state
> 2018-04-16 12:37:54: W0416 12:37:54.371826  3840 containerizer.cpp:2337] 
> Attempted to destroy unknown container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.504456  3856 http.cpp:3078] Processing 
> REMOVE_NESTED_CONTAINER call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.556367  3857 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211'
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.582137  3850 http.cpp:2758] Failed to 
> launch container 
> 

[jira] [Comment Edited] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589281#comment-16589281
 ] 

Benno Evers edited comment on MESOS-9177 at 8/22/18 7:29 PM:
-

As a preliminary update, I managed to narrow down the location of the segfault 
to this lambda inside the FullFrameworkWriter:

{code}
  foreach (const Owned& task, framework_->completedTasks) {
// Skip unauthorized tasks.
if (!approvers_->approved(*task, framework_->info)) {
  continue;
}

writer->element(*task);
  }
{code}

or more precisely 

{code}
# 
_ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
 + 0x203
1d0b913:   48 8b 51 08 mov0x8(%rcx),%rdx
{code}

Since the Mesos cluster where this segfault was observed runs with a 
non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I 
tried reproducing the crash by starting a mesos-master built from the same 
commit locally, using the `no-executor-framework` to run many tasks, and 
repeatedly hitting the state endpoint on this master. While I was able to 
overload the JSON renderer of my web browser, I didn't manage to reproduce the 
crash.

Next, I turned to reverse engineering the exact location of the crash, which 
seems to be happening while trying to increase an 
`boost::circular_buffer::iterator` (i.e. the container of 
`Master::Framework::completedTasks`). This indicates that we're probably 
pushing values into this container while simulaneously iterating in another 
thread.

However, I still haven't figured out a theory for how this could happen, or how 
to induce the crash locally, since all mutations seem to be happening on the 
Master actor and thus should not be happening in parallel.


was (Author: bennoe):
As a preliminary update, I managed to narrow down the location of the segfault 
to this lambda inside the FullFrameworkWriter:

{code}
  foreach (const Owned& task, framework_->completedTasks) {
// Skip unauthorized tasks.
if (!approvers_->approved(*task, framework_->info)) {
  continue;
}

writer->element(*task);
  }
{code}

Since the Mesos cluster where this segfault was observed runs with a 
non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I 
tried reproducing the crash by starting a mesos-master built from the same 
commit locally, using the `no-executor-framework` to run many tasks, and 
repeatedly hitting the state endpoint on this master. While I was able to 
overload the JSON renderer of my web browser, I didn't manage to reproduce the 
crash.

Next, I turned to reverse engineering the exact location of the crash, which 
seems to be happening while trying to increase an 
`boost::circular_buffer::iterator` (i.e. the container of 
`Master::Framework::completedTasks`). This indicates that we're probably 
pushing values into this container while simulaneously iterating in another 
thread.

However, I still haven't figured out a theory for how this could happen, or how 
to induce the crash locally, since all mutations seem to be happening on the 
Master actor and thus should not be happening in parallel.

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> 

[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589281#comment-16589281
 ] 

Benno Evers commented on MESOS-9177:


As a preliminary update, I managed to narrow down the location of the segfault 
to this lambda inside the FullFrameworkWriter:

{code}
  foreach (const Owned& task, framework_->completedTasks) {
// Skip unauthorized tasks.
if (!approvers_->approved(*task, framework_->info)) {
  continue;
}

writer->element(*task);
  }
{code}

Since the Mesos cluster where this segfault was observed runs with a 
non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I 
tried reproducing the crash by starting a mesos-master built from the same 
commit locally, using the `no-executor-framework` to run many tasks, and 
repeatedly hitting the state endpoint on this master. While I was able to 
overload the JSON renderer of my web browser, I didn't manage to reproduce the 
crash.

Next, I turned to reverse engineering the exact location of the crash, which 
seems to be happening while trying to increase an 
`boost::circular_buffer::iterator` (i.e. the container of 
`Master::Framework::completedTasks`). This indicates that we're probably 
pushing values into this container while simulaneously iterating in another 
thread.

However, I still haven't figured out a theory for how this could happen, or how 
to induce the crash locally, since all mutations seem to be happening on the 
Master actor and thus should not be happening in parallel.

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @ 0x7f36812215ac 
> 

[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589159#comment-16589159
 ] 

James Peach commented on MESOS-9178:


/cc [~bmahler]

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8345) Improve master responsiveness while serving state information.

2018-08-22 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8345:
--

Assignee: Alexander Rukletsov

> Improve master responsiveness while serving state information.
> --
>
> Key: MESOS-8345
> URL: https://issues.apache.org/jira/browse/MESOS-8345
> Project: Mesos
>  Issue Type: Epic
>  Components: HTTP API, master
>Reporter: Benjamin Mahler
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, performance
>
> Currently when state is requested from the master, the response is built 
> using the master actor. This means that when the master is building an 
> expensive state response, the master is locked and cannot process other 
> events. This in turn can lead to higher latency on further requests to state. 
> Previous performance improvements to JSON generation (MESOS-4235) alleviated 
> this issue, but for large cluster with a lot of clients this can still be a 
> problem.
> It's possible to serve state outside of the master actor by streaming the 
> state (re-using the existing streaming operator API) into another actor(s) 
> and serving from there.
> NOTE: I believe this approach will incur a small performance cost during 
> master failover, since the master has to perform an additional copy of state 
> that it fans out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9178) mesos metrics: master failover time

2018-08-22 Thread Xudong Ni (JIRA)
Xudong Ni created MESOS-9178:


 Summary: mesos metrics: master failover time
 Key: MESOS-9178
 URL: https://issues.apache.org/jira/browse/MESOS-9178
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Xudong Ni


Quote from Yan Xu: Previous the argument against it is that you don't know if 
all agents are going to come back after a master failover so there's not a 
certain point that marks the end of "full reregistration of all agents". 
However empirically the number of agents usually don't change during the 
failover and there's an upper bound of such wait (after a 10min timeout the 
agents that haven't reregistered are going to be marked unreachable so we can 
just use that to stop the timer.

So we can define failover time as "the time it takes for all agents recovered 
from the registry to be accounted for" i.e., either reregistered or marked as 
unreachable.

This is of course looking at failover from an agent reregistration perspective.

Later after we add framework info persistence, we can similarly define the 
framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9178) mesos metrics: master failover time

2018-08-22 Thread Xudong Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Ni reassigned MESOS-9178:


Assignee: Xudong Ni

> mesos metrics: master failover time
> ---
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9177:
-

Shepherd: Alexander Rukletsov
Assignee: Benno Evers
  Sprint: Mesosphere Sprint 2018-27
Story Points: 3
Target Version/s: 1.7.0

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @ 0x7f36812215ac 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_
>  @ 0x7f36821f3541 process::ProcessBase::consume()
>  @ 0x7f3682209fbc process::ProcessManager::resume()
>  @ 0x7f368220fa76 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>  @ 0x7f367eefc2b0 (unknown)
>  @ 0x7f367e71ae25 start_thread
>  @ 0x7f367e444bad __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976
 ] 

Qian Zhang edited comment on MESOS-8568 at 8/22/18 3:02 PM:


Reproduce steps:

1. To simulate the failure of launching nested container via health check, 
change `CgroupsIsolatorProcess::isolate` a bit:
{code:java}
Future CgroupsIsolatorProcess::isolate(
 const ContainerID& containerId,
 pid_t pid)
 {
+  if (strings::startsWith(containerId.value(), "check")) {
+return Failure("==Fake error==");
+  }
+
{code}
2. Start Mesos master and agent.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos

$ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 
--work_dir=/home/qzhang/opt/mesos 
--isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem
{code}
3. Launch a nested container with check enabled.
{code:java}
$ cat task_group_health_check.json
{
  "tasks":[
{
  "name" : "test",
  "task_id" : {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
  ],
  "command": {
"value": "touch aaa && sleep 1000"
  },
  "check": {
"type": "COMMAND",
"command": {
  "command": {
   "value": "ls aaa  > /dev/null"
  }
},
"delay_seconds": 5,
"interval_seconds": 3
  }
}
  ]
}

$ src/mesos-execute --master=10.0.49.2:5050 
--task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code}
5. After a few minutes, there will be a lot of check container's sandbox 
directories not removed. 
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers
 | grep check | wc -l
119
{code}
And in the default executor's stderr, we see a lot of warning messages.
{code:java}
...
W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
 used for the COMMAND check for task 'test'
I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12'
 used for the COMMAND check for task 'test'
I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 
'test' is not available
...{code}
So every time when the default executor called `REMOVE_NESTED_CONTAINER` to 
remove the previous check container, the call failed with a 500 error. The 
reason that this call failed is the check container has not terminated yet 
(still in `DESTROYING` state), the agent log below also proved this.
{code:java}
I0822 07:37:45.051453 19063 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module 
finished preparing container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586;
 IOSwitchboard server is required
I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 
'mesos_executors.slice'
I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server 
(pid: 19410) listening on socket file 
'/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 and 

[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976
 ] 

Qian Zhang edited comment on MESOS-8568 at 8/22/18 3:01 PM:


Reproduce steps:

1. To simulate the failure of launching nested container via health check, 
change `CgroupsIsolatorProcess::isolate` a bit:
{code:java}
Future CgroupsIsolatorProcess::isolate(
 const ContainerID& containerId,
 pid_t pid)
 {
+  if (strings::startsWith(containerId.value(), "check")) {
+return Failure("==Fake error==");
+  }
+
{code}
2. Start Mesos master and agent.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos

$ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 
--work_dir=/home/qzhang/opt/mesos 
--isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem
{code}
3. Launch a nested container with check enabled.
{code:java}
$ cat task_group_health_check.json
{
  "tasks":[
{
  "name" : "test",
  "task_id" : {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
  ],
  "command": {
"value": "touch aaa && sleep 1000"
  },
  "check": {
"type": "COMMAND",
"command": {
  "command": {
   "value": "ls aaa  > /dev/null"
  }
},
"delay_seconds": 5,
"interval_seconds": 3
  }
}
  ]
}

$ src/mesos-execute --master=10.0.49.2:5050 
--task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code}
5. After a few minutes, there will be a lot of check container's sandbox 
directories not removed. 
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers
 | grep check | wc -l
119
{code}
And in the default executor's stderr, we see a lot of warning messages.
{code:java}
...
W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
 used for the COMMAND check for task 'test'
I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12'
 used for the COMMAND check for task 'test'
I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 
'test' is not available
...{code}
So every time when the default executor called `REMOVE_NESTED_CONTAINER` to 
remove the previous check container, the call failed with a 500 error. The 
reason that this call failed is the check container has not terminated yet 
(still in `DESTROYING` state), the agent log below also proved this.
{code:java}
I0822 07:37:45.051453 19063 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module 
finished preparing container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586;
 IOSwitchboard server is required
I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 
'mesos_executors.slice'
I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server 
(pid: 19410) listening on socket file 
'/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 and 

[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976
 ] 

Qian Zhang edited comment on MESOS-8568 at 8/22/18 2:56 PM:


Reproduce steps:

1. To simulate the failure of launching nested container via health check, 
change `CgroupsIsolatorProcess::isolate` a bit:
{code:java}
Future CgroupsIsolatorProcess::isolate(
 const ContainerID& containerId,
 pid_t pid)
 {
+  if (strings::startsWith(containerId.value(), "check")) {
+return Failure("==Fake error==");
+  }
+
{code}
2. Start Mesos master and agent.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos

$ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 
--work_dir=/home/qzhang/opt/mesos 
--isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem
{code}
3. Launch a nested container with check enabled.
{code:java}
$ cat task_group_health_check.json
{
  "tasks":[
{
  "name" : "test",
  "task_id" : {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
  ],
  "command": {
"value": "touch aaa && sleep 1000"
  },
  "check": {
"type": "COMMAND",
"command": {
  "command": {
   "value": "ls aaa  > /dev/null"
  }
},
"delay_seconds": 5,
"interval_seconds": 3
  }
}
  ]
}

$ src/mesos-execute --master=10.0.49.2:5050 
--task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code}
5. After a few minutes, there will be a lot of check container's sandbox 
directories not removed.

 
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers
 | grep check | wc -l
119
{code}
And in the default executor's stderr, we see a lot of warning messages

 

 
{code:java}
...
W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
 used for the COMMAND check for task 'test'
I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12'
 used for the COMMAND check for task 'test'
I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 
'test' is not available
...{code}
 

So every time when the default executor called `REMOVE_NESTED_CONTAINER` to 
remove the previous check container, the call will fail with a 500 error. The 
reason that this call failed is the check container has not terminated yet 
(still in `DESTROYING` state), the agent log below also proved this.
{code:java}
I0822 07:37:45.051453 19063 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module 
finished preparing container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586;
 IOSwitchboard server is required
I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 
'mesos_executors.slice'
I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server 
(pid: 19410) listening on socket file 
'/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container 

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588976#comment-16588976
 ] 

Qian Zhang commented on MESOS-8568:
---

Reproduce steps:

1. To simulate the failure of launching nested container via health check, 
change `CgroupsIsolatorProcess::isolate` a bit:
{code:java}
Future CgroupsIsolatorProcess::isolate(
 const ContainerID& containerId,
 pid_t pid)
 {
+  if (strings::startsWith(containerId.value(), "check")) {
+return Failure("==Fake error==");
+  }
+
{code}
2. Start Mesos master and agent.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos

$ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --port=36251 
--work_dir=/home/qzhang/opt/mesos 
--isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem
{code}
3. Launch a nested container with check enabled.
{code:java}
$ cat task_group_health_check.json
{
  "tasks":[
{
  "name" : "test",
  "task_id" : {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
  ],
  "command": {
"value": "touch aaa && sleep 1000"
  },
  "check": {
"type": "COMMAND",
"command": {
  "command": {
   "value": "ls aaa  > /dev/null"
  }
},
"delay_seconds": 5,
"interval_seconds": 3
  }
}
  ]
}

$ src/mesos-execute --master=10.0.49.2:5050 
--task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code}
5. After a few minutes, there will be a lot of check container's sandbox 
directories not removed.

 
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers
 | grep check | wc -l
119
{code}
And in the default executor's stderr, we see a lot of warning messages

 

 
{code:java}
...
W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
 used for the COMMAND check for task 'test'
I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==Fake error==) while launching COMMAND check 
for task 'test'
W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12'
 used for the COMMAND check for task 'test'
I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 
'test' is not available
...{code}
 

So every time when the default executor called `REMOVE_NESTED_CONTAINER` to 
remove the previous check container, the call will fail with a 500 error. The 
reason that this call failed is the check container has not terminated yet 
(still in `DESTROYING` state), the agent log below also proved this.
{code:java}
I0822 07:37:45.051453 19063 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module 
finished preparing container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586;
 IOSwitchboard server is required
I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 
'mesos_executors.slice'
I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server 
(pid: 19410) listening on socket file 
'/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 and cloning with namespaces 

[jira] [Created] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-9177:
--

 Summary: Mesos master segfaults when responding to /state requests.
 Key: MESOS-9177
 URL: https://issues.apache.org/jira/browse/MESOS-9177
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.7.0
Reporter: Alexander Rukletsov


{noformat}
 *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
stack trace: ***
 @ 0x7f367e7226d0 (unknown)
 @ 0x7f3681266913 
_ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
 @ 0x7f3681266af0 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f36812882d0 mesos::internal::master::FullFrameworkWriter::operator()()
 @ 0x7f36812889d0 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f368121aef0 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f3681241be3 
_ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
 @ 0x7f3681242760 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
 @ 0x7f368215f60e process::http::OK::OK()
 @ 0x7f3681219061 
_ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
 @ 0x7f36812212c0 
_ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
 @ 0x7f36812215ac 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_
 @ 0x7f36821f3541 process::ProcessBase::consume()
 @ 0x7f3682209fbc process::ProcessManager::resume()
 @ 0x7f368220fa76 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
 @ 0x7f367eefc2b0 (unknown)
 @ 0x7f367e71ae25 start_thread
 @ 0x7f367e444bad __clone
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery

2018-08-22 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1651#comment-1651
 ] 

Stephan Erb commented on MESOS-9174:


[~jieyu], we have found something interesting related to 
[https://reviews.apache.org/r/62800]

I have not checked the entire cluster, but on a first sight it seems as if 
there are problems related to systemd.

*Nodes without recovery issues*
 Those are running Mesos 1.6.1 and systemd 232-25+deb9u4 (Debian Stretch).
{code:java}
$ systemd-cgls
Control group /:
-.slice
├─mesos
│ ├─9b70ff19-238c-4520-978c-688b83e705ce
│ │ ├─ 5129 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch
...
├─init.scope
│ └─1 /sbin/init
└─system.slice
  ├─mesos-agent.service
  │ └─1472 /usr/sbin/mesos-agent --master=file:///etc/mesos-agent/zk ...
{code}
*Nodes with recovery issues*
 Those are running Mesos 1.6.1 and systemd 237-3~bpo9+1 (Debian Stretch 
backports).
{code:java}
$ systemd-cgls
Control group /:
-.slice
├─init.scope
│ └─1 /sbin/init
└─system.slice
  ├─mesos-agent.service
  │ ├─ 19151 haproxy -f haproxy.cfg -p haproxy.pid -sf 149
  │ ├─ 39633 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch
  │ ├─ 39638 sh -c ${MESOS_SANDBOX=.}/thermos_executor_wrapper.sh 
  │ ├─ 39639 python2.7 ...
  │ ├─ 39684 /usr/bin/python2.7...
  │ ├─ 39710 /usr/bin/python2.7 ...
  │ ├─ 39714 /usr/bin/ruby /usr/bin/synapse -c synapse.conf
  │ ├─ 39775 haproxy -f haproxy.cfg -p haproxy.pid -sf
  │ ├─ 39837 /usr/bin/python2.7 ...
{code}
In particular, there is no {{mesos}} group/section even though this is 
perfectly show using systemd-cgtop:
{code:java}
/1700   
   -45.2G--
/mesos  -   
   -38.4G--
/mesos/144dc11d-dbd0-42e8-89e4-a72384e777df -   
   - 1.6G--
/mesos/15a8e488-495c-4db8-a11b-7e8277ec4c93 -   
   - 3.1G--
/mesos/2a2b5913-2445-4111-9d18-71abc9f1f8cd -   
   -  1021.2M--
/mesos/2e1c5c91-6a80-4242-b105-023c1eb2c89d -   
   - 2.6G--
/mesos/356c5c0f-2ae0-4dfc-9415-d1dbeb172542 -   
   -   898.4M--
/mesos/3baf4930-4332-4206-91d5-d39ea6bb3389 -   
   - 3.1G--
/mesos/3d1b9554-911d-44ee-b204-fe622f02ef7a -   
   -   845.0M--
/mesos/431aa2a0-11e4-4bf3-b888-ee10cf689326 -   
   - 1.3G--
/mesos/94f8e3bb-360a-4694-9359-4da10cb4e5df -   
   - 1.2G--
/mesos/9d1b3251-6c61-404e-88d0-03319d1a508c -   
   - 3.2G--
/mesos/b5bb9133-4093-4bc6-90c1-3656b20559bf -   
   -   417.6M--
/mesos/b89095dd-21bc-4255-86c8-14bd7cd0ac2a -   
   - 1.5G--
/system.slice1137   
   - 8.7G--
{code}
The output of the faulty node above is from the same node that I have used to 
pull the mesos-agent.log.

I will try to reproduce the issue by upgrading systemd in another test 
environment and then report back. Newer systemd version have a changed 
behaviour of {{Delegate=}} which could indeed be related to the observed issue.

> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---
>
> Key: MESOS-9174
> URL: https://issues.apache.org/jira/browse/MESOS-9174
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0, 1.6.1
>Reporter: Stephan Erb
>Priority: Major
> Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos 
> agent takes down all Mesos containers. The containers die without an apparent 
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container 
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container 
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state 
> of container 

[jira] [Created] (MESOS-9176) Mesos does not work properly on modern Ubuntu distributions.

2018-08-22 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-9176:
--

 Summary: Mesos does not work properly on modern Ubuntu 
distributions.
 Key: MESOS-9176
 URL: https://issues.apache.org/jira/browse/MESOS-9176
 Project: Mesos
  Issue Type: Epic
Affects Versions: 1.7.0
 Environment: Ubuntu 17.10
Ubuntu 18.04
Reporter: Alexander Rukletsov


We have observed several issues in various components on moder Ubuntus, e.g., 
17.10, 18.04. Needless to say, we need to ensure Mesos compiles and runs fine 
on those distros.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)