[jira] [Commented] (MESOS-9771) Mask sensitive procfs paths.

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834402#comment-16834402
 ] 

James Peach commented on MESOS-9771:


Since {{/proc/keys}} gets masked, we should probably mask {{/proc/key-users}} 
too. Weird that I don't see other containerizers doing that.

My main concern with this change is compatibility with containerized services 
like CSI, that may need privileged access to the host. Masking all these paths 
for this kind of service could break them.

There are a few possible solutions:
1. Skip the masking based on properties of the launch, e.g. whether the Docker 
{{privileged}} flag is set, or whether the container is joining the host's PID 
namespace.
2. Add a flag that specified the set of paths to mask, so that operators can 
whack it with configuration.
3. Unconditionally do the masking.

If we go down the path of (2), then operators who need privileged containers to 
see this information will be stranded, so my preference would be something 
closer to (1).

If we prefer (3), then we already unconditionally make certain container paths 
read-only, which could be regarded as precedent.

/cc [~jieyu] [~gilbert] [~jasonlai]


> Mask sensitive procfs paths.
> 
>
> Key: MESOS-9771
> URL: https://issues.apache.org/jira/browse/MESOS-9771
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> We already have a set of procfs paths that we mark read-only in the 
> containerizer, but there are additional paths that are considered sensitive 
> by other containerizers and are masked altogether:
> {noformat}
>"/proc/asound"
>"/proc/acpi"
> "/proc/kcore"
> "/proc/keys"
> "/proc/latency_stats"
> "/proc/timer_list"
> "/proc/timer_stats"
> "/proc/sched_debug"
> "/sys/firmware"
> "/proc/scsi"
> {noformat}
> Masking is done by mounting {{/dev/null}} on files, and an empty, readonly 
> {{tmpfs}} on directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9771) Mask sensitive procfs paths.

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9771:
--

 Summary: Mask sensitive procfs paths.
 Key: MESOS-9771
 URL: https://issues.apache.org/jira/browse/MESOS-9771
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


We already have a set of procfs paths that we mark read-only in the 
containerizer, but there are additional paths that are considered sensitive by 
other containerizers and are masked altogether:

{noformat}
  "/proc/asound"
   "/proc/acpi"
"/proc/kcore"
"/proc/keys"
"/proc/latency_stats"
"/proc/timer_list"
"/proc/timer_stats"
"/proc/sched_debug"
"/sys/firmware"
"/proc/scsi"
{noformat}

Masking is done by mounting {{/dev/null}} on files, and an empty, readonly 
{{tmpfs}} on directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9770) Add no-new-privileges isolator

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834398#comment-16834398
 ] 

James Peach commented on MESOS-9770:


/cc [~jieyu] [~gilbert] [~abudnik]

> Add no-new-privileges isolator
> --
>
> Key: MESOS-9770
> URL: https://issues.apache.org/jira/browse/MESOS-9770
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> To give security-minded operators more defense in depth, add a {{linux/nnp}} 
> isolator that sets the no-new-privileges bit before starting the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9770) Add no-new-privileges isolator

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9770:
--

 Summary: Add no-new-privileges isolator
 Key: MESOS-9770
 URL: https://issues.apache.org/jira/browse/MESOS-9770
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


To give security-minded operators more defense in depth, add a {{linux/nnp}} 
isolator that sets the no-new-privileges bit before starting the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9769) Add direct containerized support for filesystem operations

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9769:
--

 Summary: Add direct containerized support for filesystem operations
 Key: MESOS-9769
 URL: https://issues.apache.org/jira/browse/MESOS-9769
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


When setting up the container filesystems, we use `pre_exec_commands` to make 
ABI symlinks and other things. The problem with this is that, depending of the 
order of operations, we may not have the full security policy in place yet, but 
since we are running in the context of the container's mount namespaces, the 
programs we execute are under the control of whoever built the container image.

[~jieyu] and I previously discussed adding filesystem operations to the 
`ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and 
`linux/filesystem` isolators. Secrets and port mapping isolators need more, so 
we should discuss and file new tickets if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357
 ] 

James Peach edited comment on MESOS-9768 at 5/7/19 3:56 AM:


/cc [~jieyu] [~gilbert]


was (Author: jamespeach):
/cc [~jieyu] @gilbert

> Allow operators to mount the container rootfs with the `nosuid` flag
> 
>
> Key: MESOS-9768
> URL: https://issues.apache.org/jira/browse/MESOS-9768
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> If cluster users are allowed to launch containers with arbitrary images, 
> those images may container setuid programs. For security reasons (auditing, 
> privilege escalation), operators may wish to ensure that setuid programs 
> cannot be used within a container.
>  
> We should provide a way for operators to be able to specify that container 
> volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357
 ] 

James Peach commented on MESOS-9768:


/cc [~jieyu] @gilbert

> Allow operators to mount the container rootfs with the `nosuid` flag
> 
>
> Key: MESOS-9768
> URL: https://issues.apache.org/jira/browse/MESOS-9768
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> If cluster users are allowed to launch containers with arbitrary images, 
> those images may container setuid programs. For security reasons (auditing, 
> privilege escalation), operators may wish to ensure that setuid programs 
> cannot be used within a container.
>  
> We should provide a way for operators to be able to specify that container 
> volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9768:
--

 Summary: Allow operators to mount the container rootfs with the 
`nosuid` flag
 Key: MESOS-9768
 URL: https://issues.apache.org/jira/browse/MESOS-9768
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


If cluster users are allowed to launch containers with arbitrary images, those 
images may container setuid programs. For security reasons (auditing, privilege 
escalation), operators may wish to ensure that setuid programs cannot be used 
within a container.

 

We should provide a way for operators to be able to specify that container 
volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834156#comment-16834156
 ] 

Gaurav Garg commented on MESOS-9767:


I only queried the /metrics/snapshot endpoint when the master was stuck. Next 
time i will query the /master/health endpoint also. 
Unfortunately, i captured the stack trace only once. My bad. Next time i will 
capture multiple stack traces.

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834119#comment-16834119
 ] 

Benjamin Mahler commented on MESOS-9767:


The bizarre thread stack is:

{noformat}
Thread 21 (Thread 0x7fa1e0e4d700 (LWP 85889)):

#0  0x7fa1f05f01c2 in hash_combine_impl (k=52, h=)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:264
#1  hash_combine (v=, seed=)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#2  hash_range<__gnu_cxx::__normal_iterator > > (last=...,
first=52 '4') at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:351
#3  hash_value > (v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:410
#4  operator() (this=, v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:486
#5  boost::hash_combine (seed=seed@entry=@0x7fa1e0e4c770: 0, v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#6  0x7fa1f06ad178 in operator() (this=0x7fa1cc02d068, taskId=...)
at /mesos/include/mesos/type_utils.hpp:634
#7  _M_hash_code (this=0x7fa1cc02d068, __k=...) at 
/usr/include/c++/4.9/bits/hashtable_policy.h:1261
#8  std::_Hashtable > > >, 
std::allocator > > > 
>, std::__detail::_Select1st, std::equal_to, 
std::hash, std::__detail::_Mod_range_hashing, 
std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, 
std::__detail::_Hashtable_traits >::count 
(this=this@entry=0x7fa1cc02d068, __k=...)
at /usr/include/c++/4.9/bits/hashtable.h:1336
#9  0x7fa1f0663eb2 in count (__x=..., this=0x7fa1cc02d068)
at /usr/include/c++/4.9/bits/unordered_map.h:592
#10 contains (key=..., this=0x7fa1cc02d068) at 
/mesos/3rdparty/stout/include/stout/hashmap.hpp:88
#11 erase (key=..., this=0x7fa1cc02d050)
at /mesos/3rdparty/stout/include/stout/boundedhashmap.hpp:92
#12 mesos::internal::master::Master::__reregisterSlave(process::UPID const&, 
mesos::internal::ReregisterSlaveMessage&&, process::Future const&) 
(this=0x561dcf047380, pid=...,
reregisterSlaveMessage=, future=...) at /mesos/src/master/master.cpp:7369
#13 0x7fa1f14d54e1 in operator() (args#0=0x561dcf048620, this=)
at /mesos/3rdparty/libprocess/../stout/include/stout/lambda.hpp:443
#14 process::ProcessBase::consume(process::DispatchEvent&&) (this=,
event=) at /mesos/3rdparty/libprocess/src/process.cpp:3577
#15 0x7fa1f14e89b2 in serve (
event=,
this=0x561dcf048620) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#16 process::ProcessManager::resume (this=, 
process=0x561dcf048620)
at /mesos/3rdparty/libprocess/src/process.cpp:3002
#17 0x7fa1f14ee226 in operator() (__closure=0x561dcf119158)
at /mesos/3rdparty/libprocess/src/process.cpp:2511
#18 _M_invoke<> (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1700
#19 operator() (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1688
#20 
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf119140) at /usr/include/c++/4.9/thread:115
#21 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x7fa1ee520064 in start_thread (arg=0x7fa1e0e4d700) at 
pthread_create.c:309
#23 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

[~ggarg] is this trace present whenever it's hanging?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
> Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> 

[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089
 ] 

Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:54 PM:


Mesos master stopped responding to HTTP request at around 16:30PM. At around 
17:00PM, master was restarted. Logs are attached after the stack trace.

Logs of Mesos master around the same time:

{noformat}
I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.669457 85889 master.cpp:8397] 

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834118#comment-16834118
 ] 

Greg Mann commented on MESOS-9767:
--

[~ggarg] thanks for the info! Did the master respond to requests for the 
{{/master/health}} endpoint while it was stuck?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
> Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088
 ] 

Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:50 PM:


Stack trace of the Mesos master when the hang was detected. Captured using gdb.

 
{noformat}
Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf128758)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089
 ] 

Gaurav Garg commented on MESOS-9767:


Mesos master stopped responding to HTTP request at around 16:30PM. At around 
17:00PM, master was restarted. Logs are attached after the stack trace.

Logs of Mesos master around the same time:

===

I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669457 85889 master.cpp:8397] 

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088
 ] 

Gaurav Garg commented on MESOS-9767:


Stack trace of the Mesos master when the hang was detected. Captured using gdb.

 

Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700

---Type  to continue, or q  to quit---

#7  operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

---Type  to continue, or q  to quit---

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf128758)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    

[jira] [Commented] (MESOS-9766) /__processes__ endpoint can hang.

2019-05-06 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833733#comment-16833733
 ] 

Alexander Rukletsov commented on MESOS-9766:


Logging processing time: https://reviews.apache.org/r/70589/

> /__processes__ endpoint can hang.
> -
>
> Key: MESOS-9766
> URL: https://issues.apache.org/jira/browse/MESOS-9766
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs.
> Stack traces provided by [~alexr] revealed that all the threads appeared to 
> be idle waiting for events. After investigating the code, the issue was found 
> to be possible when a process gets terminated after the 
> {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the 
> dispatch and abandoning the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)