[jira] [Commented] (MESOS-9771) Mask sensitive procfs paths.
[ https://issues.apache.org/jira/browse/MESOS-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834402#comment-16834402 ] James Peach commented on MESOS-9771: Since {{/proc/keys}} gets masked, we should probably mask {{/proc/key-users}} too. Weird that I don't see other containerizers doing that. My main concern with this change is compatibility with containerized services like CSI, that may need privileged access to the host. Masking all these paths for this kind of service could break them. There are a few possible solutions: 1. Skip the masking based on properties of the launch, e.g. whether the Docker {{privileged}} flag is set, or whether the container is joining the host's PID namespace. 2. Add a flag that specified the set of paths to mask, so that operators can whack it with configuration. 3. Unconditionally do the masking. If we go down the path of (2), then operators who need privileged containers to see this information will be stranded, so my preference would be something closer to (1). If we prefer (3), then we already unconditionally make certain container paths read-only, which could be regarded as precedent. /cc [~jieyu] [~gilbert] [~jasonlai] > Mask sensitive procfs paths. > > > Key: MESOS-9771 > URL: https://issues.apache.org/jira/browse/MESOS-9771 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > We already have a set of procfs paths that we mark read-only in the > containerizer, but there are additional paths that are considered sensitive > by other containerizers and are masked altogether: > {noformat} >"/proc/asound" >"/proc/acpi" > "/proc/kcore" > "/proc/keys" > "/proc/latency_stats" > "/proc/timer_list" > "/proc/timer_stats" > "/proc/sched_debug" > "/sys/firmware" > "/proc/scsi" > {noformat} > Masking is done by mounting {{/dev/null}} on files, and an empty, readonly > {{tmpfs}} on directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9771) Mask sensitive procfs paths.
James Peach created MESOS-9771: -- Summary: Mask sensitive procfs paths. Key: MESOS-9771 URL: https://issues.apache.org/jira/browse/MESOS-9771 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach We already have a set of procfs paths that we mark read-only in the containerizer, but there are additional paths that are considered sensitive by other containerizers and are masked altogether: {noformat} "/proc/asound" "/proc/acpi" "/proc/kcore" "/proc/keys" "/proc/latency_stats" "/proc/timer_list" "/proc/timer_stats" "/proc/sched_debug" "/sys/firmware" "/proc/scsi" {noformat} Masking is done by mounting {{/dev/null}} on files, and an empty, readonly {{tmpfs}} on directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9770) Add no-new-privileges isolator
[ https://issues.apache.org/jira/browse/MESOS-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834398#comment-16834398 ] James Peach commented on MESOS-9770: /cc [~jieyu] [~gilbert] [~abudnik] > Add no-new-privileges isolator > -- > > Key: MESOS-9770 > URL: https://issues.apache.org/jira/browse/MESOS-9770 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > To give security-minded operators more defense in depth, add a {{linux/nnp}} > isolator that sets the no-new-privileges bit before starting the executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9770) Add no-new-privileges isolator
James Peach created MESOS-9770: -- Summary: Add no-new-privileges isolator Key: MESOS-9770 URL: https://issues.apache.org/jira/browse/MESOS-9770 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach To give security-minded operators more defense in depth, add a {{linux/nnp}} isolator that sets the no-new-privileges bit before starting the executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9769) Add direct containerized support for filesystem operations
James Peach created MESOS-9769: -- Summary: Add direct containerized support for filesystem operations Key: MESOS-9769 URL: https://issues.apache.org/jira/browse/MESOS-9769 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach When setting up the container filesystems, we use `pre_exec_commands` to make ABI symlinks and other things. The problem with this is that, depending of the order of operations, we may not have the full security policy in place yet, but since we are running in the context of the container's mount namespaces, the programs we execute are under the control of whoever built the container image. [~jieyu] and I previously discussed adding filesystem operations to the `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and `linux/filesystem` isolators. Secrets and port mapping isolators need more, so we should discuss and file new tickets if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
[ https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357 ] James Peach edited comment on MESOS-9768 at 5/7/19 3:56 AM: /cc [~jieyu] [~gilbert] was (Author: jamespeach): /cc [~jieyu] @gilbert > Allow operators to mount the container rootfs with the `nosuid` flag > > > Key: MESOS-9768 > URL: https://issues.apache.org/jira/browse/MESOS-9768 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > If cluster users are allowed to launch containers with arbitrary images, > those images may container setuid programs. For security reasons (auditing, > privilege escalation), operators may wish to ensure that setuid programs > cannot be used within a container. > > We should provide a way for operators to be able to specify that container > volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
[ https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357 ] James Peach commented on MESOS-9768: /cc [~jieyu] @gilbert > Allow operators to mount the container rootfs with the `nosuid` flag > > > Key: MESOS-9768 > URL: https://issues.apache.org/jira/browse/MESOS-9768 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > If cluster users are allowed to launch containers with arbitrary images, > those images may container setuid programs. For security reasons (auditing, > privilege escalation), operators may wish to ensure that setuid programs > cannot be used within a container. > > We should provide a way for operators to be able to specify that container > volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
James Peach created MESOS-9768: -- Summary: Allow operators to mount the container rootfs with the `nosuid` flag Key: MESOS-9768 URL: https://issues.apache.org/jira/browse/MESOS-9768 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach If cluster users are allowed to launch containers with arbitrary images, those images may container setuid programs. For security reasons (auditing, privilege escalation), operators may wish to ensure that setuid programs cannot be used within a container. We should provide a way for operators to be able to specify that container volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834156#comment-16834156 ] Gaurav Garg commented on MESOS-9767: I only queried the /metrics/snapshot endpoint when the master was stuck. Next time i will query the /master/health endpoint also. Unfortunately, i captured the stack trace only once. My bad. Next time i will capture multiple stack traces. > Add self health monitoring in Mesos master > -- > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master >Affects Versions: 1.6.0 >Reporter: Gaurav Garg >Priority: Major > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the > [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] > for this one. One downside is that HealthChecker currently takes TaskId as > an input which is not valid for master health check. > We can add following flags to control the self heath checking: > # self_monitoring_enabled: Whether self monitoring is enabled. > # self_monitoring_consecutive_failures: After this many number of health > failures, master is crashed. > # self_monitoring_interval_secs: Interval at which health checks are > performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834119#comment-16834119 ] Benjamin Mahler commented on MESOS-9767: The bizarre thread stack is: {noformat} Thread 21 (Thread 0x7fa1e0e4d700 (LWP 85889)): #0 0x7fa1f05f01c2 in hash_combine_impl (k=52, h=) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:264 #1 hash_combine (v=, seed=) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337 #2 hash_range<__gnu_cxx::__normal_iterator > > (last=..., first=52 '4') at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:351 #3 hash_value > (v=...) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:410 #4 operator() (this=, v=...) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:486 #5 boost::hash_combine (seed=seed@entry=@0x7fa1e0e4c770: 0, v=...) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337 #6 0x7fa1f06ad178 in operator() (this=0x7fa1cc02d068, taskId=...) at /mesos/include/mesos/type_utils.hpp:634 #7 _M_hash_code (this=0x7fa1cc02d068, __k=...) at /usr/include/c++/4.9/bits/hashtable_policy.h:1261 #8 std::_Hashtable > > >, std::allocator > > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::count (this=this@entry=0x7fa1cc02d068, __k=...) at /usr/include/c++/4.9/bits/hashtable.h:1336 #9 0x7fa1f0663eb2 in count (__x=..., this=0x7fa1cc02d068) at /usr/include/c++/4.9/bits/unordered_map.h:592 #10 contains (key=..., this=0x7fa1cc02d068) at /mesos/3rdparty/stout/include/stout/hashmap.hpp:88 #11 erase (key=..., this=0x7fa1cc02d050) at /mesos/3rdparty/stout/include/stout/boundedhashmap.hpp:92 #12 mesos::internal::master::Master::__reregisterSlave(process::UPID const&, mesos::internal::ReregisterSlaveMessage&&, process::Future const&) (this=0x561dcf047380, pid=..., reregisterSlaveMessage=, future=...) at /mesos/src/master/master.cpp:7369 #13 0x7fa1f14d54e1 in operator() (args#0=0x561dcf048620, this=) at /mesos/3rdparty/libprocess/../stout/include/stout/lambda.hpp:443 #14 process::ProcessBase::consume(process::DispatchEvent&&) (this=, event=) at /mesos/3rdparty/libprocess/src/process.cpp:3577 #15 0x7fa1f14e89b2 in serve ( event=, this=0x561dcf048620) at /mesos/3rdparty/libprocess/include/process/process.hpp:87 #16 process::ProcessManager::resume (this=, process=0x561dcf048620) at /mesos/3rdparty/libprocess/src/process.cpp:3002 #17 0x7fa1f14ee226 in operator() (__closure=0x561dcf119158) at /mesos/3rdparty/libprocess/src/process.cpp:2511 #18 _M_invoke<> (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1700 #19 operator() (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1688 #20 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf119140) at /usr/include/c++/4.9/thread:115 #21 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #22 0x7fa1ee520064 in start_thread (arg=0x7fa1e0e4d700) at pthread_create.c:309 #23 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {noformat} [~ggarg] is this trace present whenever it's hanging? > Add self health monitoring in Mesos master > -- > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master >Affects Versions: 1.6.0 >Reporter: Gaurav Garg >Priority: Major > Fix For: 1.7.2 > > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the >
[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089 ] Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:54 PM: Mesos master stopped responding to HTTP request at around 16:30PM. At around 17:00PM, master was restarted. Logs are attached after the stack trace. Logs of Mesos master around the same time: {noformat} I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669457 85889 master.cpp:8397]
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834118#comment-16834118 ] Greg Mann commented on MESOS-9767: -- [~ggarg] thanks for the info! Did the master respond to requests for the {{/master/health}} endpoint while it was stuck? > Add self health monitoring in Mesos master > -- > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master >Affects Versions: 1.6.0 >Reporter: Gaurav Garg >Priority: Major > Fix For: 1.7.2 > > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the > [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] > for this one. One downside is that HealthChecker currently takes TaskId as > an input which is not valid for master health check. > We can add following flags to control the self heath checking: > # self_monitoring_enabled: Whether self monitoring is enabled. > # self_monitoring_consecutive_failures: After this many number of health > failures, master is crashed. > # self_monitoring_interval_secs: Interval at which health checks are > performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088 ] Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:50 PM: Stack trace of the Mesos master when the hang was detected. Captured using gdb. {noformat} Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf128758) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089 ] Gaurav Garg commented on MESOS-9767: Mesos master stopped responding to HTTP request at around 16:30PM. At around 17:00PM, master was restarted. Logs are attached after the stack trace. Logs of Mesos master around the same time: === I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669457 85889 master.cpp:8397]
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088 ] Gaurav Garg commented on MESOS-9767: Stack trace of the Mesos master when the hang was detected. Captured using gdb. Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700 ---Type to continue, or q to quit--- #7 operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 ---Type to continue, or q to quit--- #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf128758) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=)
[jira] [Commented] (MESOS-9766) /__processes__ endpoint can hang.
[ https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833733#comment-16833733 ] Alexander Rukletsov commented on MESOS-9766: Logging processing time: https://reviews.apache.org/r/70589/ > /__processes__ endpoint can hang. > - > > Key: MESOS-9766 > URL: https://issues.apache.org/jira/browse/MESOS-9766 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > > A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs. > Stack traces provided by [~alexr] revealed that all the threads appeared to > be idle waiting for events. After investigating the code, the issue was found > to be possible when a process gets terminated after the > {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the > dispatch and abandoning the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)