[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-10-28 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222460#comment-17222460
 ] 

Andrei Sekretenko commented on MESOS-10188:
---

Note: this *probably* is a duplicate of MESOS-10194.

> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Major
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-09-29 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203898#comment-17203898
 ] 

Andrei Sekretenko commented on MESOS-10188:
---

Given that we don't see this bug triggered in other environments and that I 
wasn't able to reproduce this in simple tests, I've lowered the priority to 
"Major".

[~Jerome Soussens] Can you please update this issue if you find something or 
manage to reproduce this again? A stacktrace of this crash would be *very* 
helpful.

> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Major
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-09-15 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196148#comment-17196148
 ] 

Andrei Sekretenko commented on MESOS-10188:
---

[~Jerome Soussens] Are you sure there is no stack trace above that? In my 
experience,  crash stacks usually get intermixed with log lines, which are also 
written into stdout.

At this point, this does not look like something we introduced into 1.10 
(although, in pre-1.10 a crash would have been impossible due to absent check; 
most likely Mesos would have ended up mis-accounting resources somewhere).
My current suspicion is that there is some kind of race between TEARDOWN call 
and terminal status transitions of tasks/executors in the master...



> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Critical
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-09-14 Thread Jerome Soussens (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195456#comment-17195456
 ] 

Jerome Soussens commented on MESOS-10188:
-

Hi,

No stack trace present and here is what I see from system logs :
{code:java}
Sep 11 06:43:25 dev-eu-w-01-sgmm-0-0 systemd: mesos-master.service: main 
process exited, code=killed, status=6/ABRT
Sep 11 06:43:25 dev-eu-w-01-sgmm-0-0 systemd: Unit mesos-master.service entered 
failed state.
Sep 11 06:43:25 dev-eu-w-01-sgmm-0-0 systemd: mesos-master.service failed.
Sep 11 06:43:45 dev-eu-w-01-sgmm-0-0 systemd: mesos-master.service holdoff time 
over, scheduling restart.
Sep 11 06:43:45 dev-eu-w-01-sgmm-0-0 systemd: Stopped Mesos Master.
{code}
I'm continuing to investigate.

> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Critical
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-09-14 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195415#comment-17195415
 ] 

Andrei Sekretenko commented on MESOS-10188:
---

Thank you for the logs! 

The check that made the master crash has indeed been introduced in 1.10.x by 
the fix for https://issues.apache.org/jira/browse/MESOS-10128.

Note that in addition to re-adding 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732, a 
TEARDOWN call for framework fec2ca2a-a1df-44ea-accc-db73eae96e63 has been 
issued right before the crash.
>From looking at the possible call sites and at the preceding log, my 
>impression is that we have a double untracking of resources somewhere in the 
>task/framework removal path (or maybe in the case when framework removal 
>overlaps with some agent-related changes).

I have yet to figure out where exactly this happens, and whether this is 
causing silent issues in older Mesos versions or not.

[~Jerome Soussens]
Do you have the stack trace of the crash? 
That could greatly simplify things, as  there is something like 5 different 
paths through the allocator that might have triggered that check.

Normally, the stack trace is printed by Mesos master (via the signal handler in 
the glog library) into the mesos master stderr without any logging prefixes. 
Also, depending on your configuration, you might have a coredump left from the 
crash; in that case, the stack trace should be also extractable from there.

> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Critical
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-09-14 Thread Jerome Soussens (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195274#comment-17195274
 ] 

Jerome Soussens commented on MESOS-10188:
-

Hi [~asekretenko],

Here is the Jira bug ticket for the issue we discussed on slack.

It seems to have a link with the network outage we had on the agent.

Thanks in advance for your help,

Jerome

> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Critical
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)