[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195415#comment-17195415
 ] 

Andrei Sekretenko commented on MESOS-10188:
-------------------------------------------

Thank you for the logs! 

The check that made the master crash has indeed been introduced in 1.10.x by 
the fix for https://issues.apache.org/jira/browse/MESOS-10128.

Note that in addition to re-adding 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732, a 
TEARDOWN call for framework fec2ca2a-a1df-44ea-accc-db73eae96e63 has been 
issued right before the crash.
>From looking at the possible call sites and at the preceding log, my 
>impression is that we have a double untracking of resources somewhere in the 
>task/framework removal path (or maybe in the case when framework removal 
>overlaps with some agent-related changes).

I have yet to figure out where exactly this happens, and whether this is 
causing silent issues in older Mesos versions or not.

[~Jerome Soussens]
Do you have the stack trace of the crash? 
That could greatly simplify things, as  there is something like 5 different 
paths through the allocator that might have triggered that check.

Normally, the stack trace is printed by Mesos master (via the signal handler in 
the glog library) into the mesos master stderr without any logging prefixes. 
Also, depending on your configuration, you might have a coredump left from the 
crash; in that case, the stack trace should be also extractable from there.

> Master check failure : scalars does not contain agent
> -----------------------------------------------------
>
>                 Key: MESOS-10188
>                 URL: https://issues.apache.org/jira/browse/MESOS-10188
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Jerome Soussens
>            Priority: Critical
>         Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to