[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834089#comment-16834089
 ] 

Gaurav Garg commented on MESOS-9767:
------------------------------------

Mesos master stopped responding to HTTP request at around 16:30PM. At around 
17:00PM, master was restarted. Logs are attached after the stack trace.

Logs of Mesos master around the same time:

===============================================

I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.666679 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.669457 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 93859166-b4fc-4d91-85ef-ade71698c022-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.669682 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d906b8c9-bc7b-4e74-ab82-d5fc968f1988-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.669889 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 96e729cc-2cb6-4ad2-ac4c-c620e85335c7-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.670122 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7302680c-a425-48c5-8242-dc3a7a6cd4d9-1-32345 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.670343 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d4b52390-d49d-429c-a022-db21691b9344-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.670552 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.670775 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task e67aeebc-af12-475f-8d10-72c01038fa6a-3-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2-9999 'Unreachable agent re-reregistered'

I0429 16:26:45.682024 85887 replica.cpp:695] Replica received learned notice 
for position 514 from log-network(1)@10.184.9.45:5050

I0429 16:27:02.294006 85882 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:28:01.811879 85886 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:29:01.488752 85880 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:30:02.016119 85885 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:31:01.536274 85886 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:32:02.085350 85891 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:33:01.619495 85881 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:34:02.175596 85886 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:35:01.678314 85883 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:36:02.288586 85884 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:37:01.757455 85887 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:38:02.301651 85893 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:39:01.829080 85888 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:40:01.492192 85888 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:41:01.863240 85882 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:42:01.390673 85887 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:43:01.895854 85891 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:44:01.810654 85891 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:45:02.354336 85898 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:46:01.882808 85897 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:47:01.370826 85897 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:47:38.727550 85875 master_wonka_auth.cpp:315] refreshWithInterval is 
invoked

I0429 16:47:38.727607 85875 master_wonka_auth.cpp:254] Starting another refresh 
of certificate

I0429 16:47:38.727680 85875 master_wonka_auth.cpp:282] Refreshing certificate 
with command 'wonkacli --self mesos-master certificate --taskid signature 
--key-path /var/run/mesos-master/wonka_secret.keyEwCJDx --cert-path 
/var/run/mesos-master/wonka_certificate.jsonnxJCpi --x509-path 
/var/run/mesos-master/wonka_secret.pemwf2vb3^@'

I0429 16:47:38.949348 85884 master_wonka_auth.cpp:288] Certificate refresh 
successfully

I0429 16:47:38.950302 85896 master_wonka_auth.cpp:367] A new set of wonka 
cert/key is refreshed

I0429 16:47:38.950302 85896 master_wonka_auth.cpp:367] A new set of wonka 
cert/key is refreshed

I0429 16:48:01.941006 85893 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:49:01.516024 85891 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:50:02.129768 85890 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:51:01.494510 85894 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:52:02.025658 85893 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:53:01.560618 85881 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:54:01.427726 85879 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:55:02.016364 85883 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:56:02.149230 85888 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:57:01.649997 85893 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:58:02.283780 85881 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 16:59:01.839570 85875 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 17:00:01.479961 85898 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 17:01:02.008523 85878 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 17:02:01.553831 85886 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 17:06:15.964160 85884 group.cpp:452] Lost connection to ZooKeeper, 
attempting to reconnect ...

I0429 17:06:15.964236 85877 group.cpp:452] Lost connection to ZooKeeper, 
attempting to reconnect ...

I0429 17:06:15.964267 85875 group.cpp:452] Lost connection to ZooKeeper, 
attempting to reconnect ...

I0429 17:06:15.964455 85882 group.cpp:452] Lost connection to ZooKeeper, 
attempting to reconnect ...

I0429 17:06:15.965993 85894 process.cpp:2111] Failed to shutdown socket with fd 
129, address 10.184.9.45:5050: Transport endpoint is not connected

I0429 17:06:15.966585 85882 replica.cpp:497] Replica received implicit promise 
request from __req_res__(1)@10.184.36.29:5050 with proposal 7

I0429 17:06:15.971314 85882 replica.cpp:344] Persisted promised to 7

I0429 17:06:15.973600 85890 http.cpp:852] Authorizing principal 'ANY' to GET 
the endpoint '/metrics/snapshot'

I0429 17:06:19.301247 85894 group.cpp:511] ZooKeeper session expired

I0429 17:06:19.301331 85882 contender.cpp:217] Membership cancelled: 16

I0429 17:06:19.301440 85878 group.cpp:511] ZooKeeper session expired

I0429 17:06:19.301568 85883 network.hpp:436] ZooKeeper group memberships changed

I0429 17:06:19.301643 85883 network.hpp:484] ZooKeeper group PIDs: {  }

I0429 17:06:19.302465 85888 group.cpp:511] ZooKeeper session expired

I0429 17:06:19.302683 85880 detector.cpp:152] Detected a new leader: None

I0429 17:06:19.302994 85895 group.cpp:511] ZooKeeper session expired

I0429 17:06:19.303202 85897 log.cpp:258] Renewing replica group membership

I0429 17:06:19.305160 85876 group.cpp:341] Group process 
(zookeeper-group(2)@10.184.9.45:5050) connected to ZooKeeper

I0429 17:06:19.305202 85876 group.cpp:831] Syncing group operations: queue size 
(joins, cancels, datas) = (1, 0, 0)

I0429 17:06:19.305217 85876 group.cpp:419] Trying to create path 
'/mesos/log_replicas' in ZooKeeper

I0429 17:06:19.305312 85894 group.cpp:341] Group process 
(zookeeper-group(3)@10.184.9.45:5050) connected to ZooKeeper

 

> Add self health monitoring in Mesos master
> ------------------------------------------
>
>                 Key: MESOS-9767
>                 URL: https://issues.apache.org/jira/browse/MESOS-9767
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>    Affects Versions: 1.6.0
>            Reporter: Gaurav Garg
>            Priority: Major
>             Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to