[
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961259#comment-16961259
]
Asha Rostamianfar edited comment on MESOS-9609 at 10/28/19 8:10 PM:
--------------------------------------------------------------------
[~greggomann] We have been seeing this issue frequently in the Toil workflow
engine after updating Mesos from v1.0.1[*]. I have attached logs and details on
[https://github.com/DataBiosphere/toil/issues/2740], which can hopefully help
with debugging the issue. Please let me know if you need any additional info
and/or any help with reproducing/debugging the issue! Thanks!
Link to most recent log:
[https://github.com/DataBiosphere/toil/files/3779869/mesos-1.9.0-crashlog.txt]
from v1.9.0
[*] Updating to v1.8.* or v1.9.0 was the first Mesos update on the Toil
pipeline in a long time, so unfortunately, we don't have more granular version
changes about when this started happening. However, we are certain that no such
issue existed in the old version. Also, note that we're using Ubuntu 18.04 in
the new version while the old pipeline used Ubuntu 16.04.
was (Author: arostami):
[~greggomann] We have been seeing this issue frequently in the Toil workflow
engine after updating Mesos from v1.0.1[*]. I have attached logs and details on
https://github.com/DataBiosphere/toil/issues/2740, which can hopefully help
with debugging the issue. Please let me know if you need any additional info
and/or any help with reproducing/debugging the issue! Thanks!
[*] Updating to v1.8.* or v1.9.0 was the first Mesos update on the Toil
pipeline in a long time, so unfortunately, we don't have more granular version
changes about when this started happening. However, we are certain that no such
issue existed in the old version. Also, note that we're using Ubuntu 18.04 in
the new version while the old pipeline used Ubuntu 16.04.
> Master check failure when marking agent unreachable
> ---------------------------------------------------
>
> Key: MESOS-9609
> URL: https://issues.apache.org/jira/browse/MESOS-9609
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.5.0
> Reporter: Greg Mann
> Assignee: Greg Mann
> Priority: Critical
> Labels: foundations, mesosphere
> Fix For: 1.9.0
>
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815433 13
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815588 13
> master.cpp:5467] Processing DECLINE call for offers: [
> 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework
> 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815693 13
> master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820142 10
> master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820367 10
> registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the
> registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820572 10
> registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820642 11
> master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9
> hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.851961 11
> master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time)
> try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1
> (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c
> google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown)
> Mar 11 10:04:38 research systemd[1]: mesos-master2.service: main process
> exited, code=exited, status=139/n/a
> Mar 11 10:04:38 research docker[18886]: mesos-master
> Mar 11 10:04:38 research systemd[1]: Unit mesos-master2.service entered
> failed state.
> {code}
> Additional case:
> {noformat}
> I0715 02:56:40.071446 13 master.cpp:1295] Agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051
> (10.0.138.150) disconnected
> I0715 02:56:40.071503 13 master.cpp:3333] Disconnecting agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051
> (10.0.138.150)
> I0715 02:56:40.071527 13 master.cpp:3352] Deactivating agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051
> (10.0.138.150)
> I0715 02:56:40.071563 13 master.cpp:1319] Removing framework
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from disconnected agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051
> (10.0.138.150) because the framework is not checkpointing
> I0715 02:56:40.071579 13 master.cpp:11006] Removing framework
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051
> (10.0.138.150)
> I0715 02:56:40.071583 12 hierarchical.cpp:829] Agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 deactivated
> I0715 02:56:40.071619 13 master.cpp:11766] Removing executor 'toil-41'
> with resources {} of framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 on
> agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051
> (10.0.138.150)
> I0715 02:58:08.642220 12 master.cpp:9130] Marking agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health
> check timed out
> I0715 02:58:08.642675 11 registrar.cpp:487] Applied 1 operations in
> 305592ns; attempting to update the registry
> I0715 02:58:08.642922 13 registrar.cpp:544] Successfully updated the
> registry in 187904ns
> I0715 02:58:08.643081 17 master.cpp:9173] Marked agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health
> check timed out
> F0715 02:58:08.643210 17 master.cpp:11402] Check failed: 'framework' Must
> be non NULL
> *** Check failure stack trace: ***
> I0715 02:58:08.643254 12 hierarchical.cpp:680] Removed agent
> 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0
> @ 0x7ffbcffd090d google::LogMessage::Fail()
> @ 0x7ffbcffd2748 google::LogMessage::SendToLog()
> @ 0x7ffbcffd04f3 google::LogMessage::Flush()
> @ 0x7ffbcffd31d9 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7ffbcec65024 google::CheckNotNull<>()
> @ 0x7ffbcec32658 mesos::internal::master::Master::__removeSlave()
> @ 0x7ffbcec33b13 mesos::internal::master::Master::_markUnreachable()
> @ 0x7ffbcec33e55
> _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlbE_JbEEEEclEv
> @ 0x7ffbce93d5d8
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7ffbcff18371 process::ProcessBase::consume()
> @ 0x7ffbcff3a97a process::ProcessManager::resume()
> @ 0x7ffbcff3e6a6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7ffbcc2cd9e0 (unknown)
> @ 0x7ffbcbde06db start_thread
> @ 0x7ffbcbb0988f (unknown)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)