[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834089#comment-16834089 ]
Gaurav Garg commented on MESOS-9767: ------------------------------------ Mesos master stopped responding to HTTP request at around 16:30PM. At around 17:00PM, master was restarted. Logs are attached after the stack trace. Logs of Mesos master around the same time: =============================================== I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.666679 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.669457 85889 master.cpp:8397] Sending status update TASK_FAILED for task 93859166-b4fc-4d91-85ef-ade71698c022-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.669682 85889 master.cpp:8397] Sending status update TASK_FAILED for task d906b8c9-bc7b-4e74-ab82-d5fc968f1988-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.669889 85889 master.cpp:8397] Sending status update TASK_FAILED for task 96e729cc-2cb6-4ad2-ac4c-c620e85335c7-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.670122 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7302680c-a425-48c5-8242-dc3a7a6cd4d9-1-32345 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.670343 85889 master.cpp:8397] Sending status update TASK_FAILED for task d4b52390-d49d-429c-a022-db21691b9344-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.670552 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.670775 85889 master.cpp:8397] Sending status update TASK_FAILED for task e67aeebc-af12-475f-8d10-72c01038fa6a-3-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered' I0429 16:26:45.682024 85887 replica.cpp:695] Replica received learned notice for position 514 from log-network(1)@10.184.9.45:5050 I0429 16:27:02.294006 85882 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:28:01.811879 85886 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:29:01.488752 85880 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:30:02.016119 85885 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:31:01.536274 85886 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:32:02.085350 85891 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:33:01.619495 85881 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:34:02.175596 85886 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:35:01.678314 85883 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:36:02.288586 85884 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:37:01.757455 85887 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:38:02.301651 85893 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:39:01.829080 85888 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:40:01.492192 85888 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:41:01.863240 85882 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:42:01.390673 85887 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:43:01.895854 85891 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:44:01.810654 85891 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:45:02.354336 85898 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:46:01.882808 85897 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:47:01.370826 85897 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:47:38.727550 85875 master_wonka_auth.cpp:315] refreshWithInterval is invoked I0429 16:47:38.727607 85875 master_wonka_auth.cpp:254] Starting another refresh of certificate I0429 16:47:38.727680 85875 master_wonka_auth.cpp:282] Refreshing certificate with command 'wonkacli --self mesos-master certificate --taskid signature --key-path /var/run/mesos-master/wonka_secret.keyEwCJDx --cert-path /var/run/mesos-master/wonka_certificate.jsonnxJCpi --x509-path /var/run/mesos-master/wonka_secret.pemwf2vb3^@' I0429 16:47:38.949348 85884 master_wonka_auth.cpp:288] Certificate refresh successfully I0429 16:47:38.950302 85896 master_wonka_auth.cpp:367] A new set of wonka cert/key is refreshed I0429 16:47:38.950302 85896 master_wonka_auth.cpp:367] A new set of wonka cert/key is refreshed I0429 16:48:01.941006 85893 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:49:01.516024 85891 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:50:02.129768 85890 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:51:01.494510 85894 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:52:02.025658 85893 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:53:01.560618 85881 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:54:01.427726 85879 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:55:02.016364 85883 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:56:02.149230 85888 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:57:01.649997 85893 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:58:02.283780 85881 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 16:59:01.839570 85875 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 17:00:01.479961 85898 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 17:01:02.008523 85878 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 17:02:01.553831 85886 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 17:06:15.964160 85884 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ... I0429 17:06:15.964236 85877 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ... I0429 17:06:15.964267 85875 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ... I0429 17:06:15.964455 85882 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ... I0429 17:06:15.965993 85894 process.cpp:2111] Failed to shutdown socket with fd 129, address 10.184.9.45:5050: Transport endpoint is not connected I0429 17:06:15.966585 85882 replica.cpp:497] Replica received implicit promise request from __req_res__(1)@10.184.36.29:5050 with proposal 7 I0429 17:06:15.971314 85882 replica.cpp:344] Persisted promised to 7 I0429 17:06:15.973600 85890 http.cpp:852] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0429 17:06:19.301247 85894 group.cpp:511] ZooKeeper session expired I0429 17:06:19.301331 85882 contender.cpp:217] Membership cancelled: 16 I0429 17:06:19.301440 85878 group.cpp:511] ZooKeeper session expired I0429 17:06:19.301568 85883 network.hpp:436] ZooKeeper group memberships changed I0429 17:06:19.301643 85883 network.hpp:484] ZooKeeper group PIDs: { } I0429 17:06:19.302465 85888 group.cpp:511] ZooKeeper session expired I0429 17:06:19.302683 85880 detector.cpp:152] Detected a new leader: None I0429 17:06:19.302994 85895 group.cpp:511] ZooKeeper session expired I0429 17:06:19.303202 85897 log.cpp:258] Renewing replica group membership I0429 17:06:19.305160 85876 group.cpp:341] Group process (zookeeper-group(2)@10.184.9.45:5050) connected to ZooKeeper I0429 17:06:19.305202 85876 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0429 17:06:19.305217 85876 group.cpp:419] Trying to create path '/mesos/log_replicas' in ZooKeeper I0429 17:06:19.305312 85894 group.cpp:341] Group process (zookeeper-group(3)@10.184.9.45:5050) connected to ZooKeeper > Add self health monitoring in Mesos master > ------------------------------------------ > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master > Affects Versions: 1.6.0 > Reporter: Gaurav Garg > Priority: Major > Fix For: 1.7.2 > > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the > [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] > for this one. One downside is that HealthChecker currently takes TaskId as > an input which is not valid for master health check. > We can add following flags to control the self heath checking: > # self_monitoring_enabled: Whether self monitoring is enabled. > # self_monitoring_consecutive_failures: After this many number of health > failures, master is crashed. > # self_monitoring_interval_secs: Interval at which health checks are > performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)