[ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961259#comment-16961259
 ] 

Asha Rostamianfar commented on MESOS-9609:
------------------------------------------

[~greggomann] We have been seeing this issue frequently in the Toil workflow 
engine after updating Mesos from v1.0.1[*]. I have attached logs and details on 
https://github.com/DataBiosphere/toil/issues/2740, which can hopefully help 
with debugging the issue.  Please let me know if you need any additional info 
and/or any help with reproducing/debugging the issue! Thanks!

[*] Updating to v1.8.* or v1.9.0 was the first Mesos update on the Toil 
pipeline in a long time, so unfortunately, we don't have more granular version 
changes about when this started happening. However, we are certain that no such 
issue existed in the old version. Also, note that we're using Ubuntu 18.04 in 
the new version while the old pipeline used Ubuntu 16.04.

> Master check failure when marking agent unreachable
> ---------------------------------------------------
>
>                 Key: MESOS-9609
>                 URL: https://issues.apache.org/jira/browse/MESOS-9609
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Greg Mann
>            Assignee: Greg Mann
>            Priority: Critical
>              Labels: foundations, mesosphere
>             Fix For: 1.9.0
>
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815433    13 
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815588    13 
> master.cpp:5467] Processing DECLINE call for offers: [ 
> 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 
> 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815693    13 
> master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820142    10 
> master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820367    10 
> registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the 
> registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820572    10 
> registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820642    11 
> master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957     9 
> hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.851961    11 
> master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d  
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830  
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663  
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259  
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14  
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8  
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2  
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11  
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a  
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80  (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba  start_thread
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d  (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) 
> try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @     0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 
> (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c604ce2c 
> google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d 
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830 
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663 
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259 
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14 
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8 
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2 
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11 
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a 
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba start_thread
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d (unknown)
> Mar 11 10:04:38 research systemd[1]: mesos-master2.service: main process 
> exited, code=exited, status=139/n/a
> Mar 11 10:04:38 research docker[18886]: mesos-master
> Mar 11 10:04:38 research systemd[1]: Unit mesos-master2.service entered 
> failed state.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to