[ 
https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701297#comment-16701297
 ] 

Chun-Hung Hsiao commented on MESOS-9419:
----------------------------------------

I can confirm that the unit test will crash the master on 1.2.0:
{noformat}
I1127 17:59:18.429213 35500 executor.cpp:192] Version: 1.2.0
...
W1127 17:59:18.488634 35478 master.hpp:2261] Master attempted to send message 
to disconnected framework e687230b-c9b1-471d-b8e2-e9b380229227-0000 (default)
F1127 17:59:18.488682 35478 master.hpp:2271] CHECK_SOME(pid): is NONE
*** Check failure stack trace: ***
    @     0x7f3e4023963d  (unknown)
    @     0x7f3e40238a79  (unknown)
    @     0x7f3e4023931c  (unknown)
    @     0x7f3e4023c6a9  (unknown)
    @           0x9830d4  _CheckFatal::~_CheckFatal()
    @     0x7f3e3ead1389  (unknown)
    @     0x7f3e3ea60b78  (unknown)
    @     0x7f3e3eb47f5f  (unknown)
    @     0x7f3e3eb4961e  (unknown)
    @     0x7f3e3eb49356  (unknown)
    @     0x7f3e3eb49167  (unknown)
    @          0x1272d20  std::function<>::operator()()
    @     0x7f3e3eac2ecb  (unknown)
    @     0x7f3e3ea6db29  (unknown)
    @     0x7f3e3ea6d183  (unknown)
    @     0x7f3e3eaf62ae  (unknown)
    @           0x970921  process::ProcessBase::serve()
    @     0x7f3e40167945  (unknown)
    @     0x7f3e40170b96  (unknown)
    @     0x7f3e40170ab5  (unknown)
    @     0x7f3e40170a85  (unknown)
    @     0x7f3e40170969  (unknown)
    @     0x7f3e40398fcf  (unknown)
    @     0x7f3e38c67e25  start_thread
    @     0x7f3e37b5334d  __clone
{noformat}

> Executor to framework message crashes master if framework has not 
> re-registered.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-9419
>                 URL: https://issues.apache.org/jira/browse/MESOS-9419
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.3.0, 1.3.1, 1.3.2, 1.4.0, 
> 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0
>            Reporter: Benjamin Mahler
>            Assignee: Chun-Hung Hsiao
>            Priority: Blocker
>
> If the executor sends a framework message after a master failover, and the 
> framework has not yet re-registered with the master, this will crash the 
> master:
> {code}
> W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send 
> message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4-0000 
> (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb)
> F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE
> *** Check failure stack trace: ***
> *** @ 0x7f09e016b6cd google::LogMessage::Fail()
> *** @ 0x7f09e016d38d google::LogMessage::SendToLog()
> *** @ 0x7f09e016b2b3 google::LogMessage::Flush()
> *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal()
> *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal()
> *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>()
> *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage()
> *** @ 0x7f09df3b06a4 
> _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\
>  
> _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEJS9_SC_SF_SN_EEEvPS3_MS3\
>  _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 
> std::_Function_handler<>::_M_invoke()
> *** @ 0x7f09df36930f ProtobufProcess<>::consume()
> *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume()
> *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume()
> *** @ 0x7f09e00d9c7a process::ProcessManager::resume()
> *** @ 0x7f09e00dd836 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> *** @ 0x7f09dd467ac8 execute_native_thread_routine
> *** @ 0x7f09dd6f6b50 start_thread
> *** @ 0x7f09dcc7030d (unknown)
> {code}
> This is because Framework::send proceeds if the framework is disconnected. In 
> the case of a recovered framework, it will not have a pid or http connection 
> yet:
> https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610
> {code}
> // Sends a message to the connected framework.
> template <typename Message>
> void Framework::send(const Message& message)
> {
>   if (!connected()) {
>     LOG(WARNING) << "Master attempted to send message to disconnected"
>                  << " framework " << *this;
>     // XXX proceeds!
>   }
>   metrics.incrementEvent(message);
>   if (http.isSome()) {
>     if (!http->send(message)) {
>       LOG(WARNING) << "Unable to send event to framework " << *this << ":"
>                    << " connection closed";
>     }
>   } else {
>     CHECK_SOME(pid); // XXX Will crash.
>     master->send(pid.get(), message);
>   }
> }
> {code}
> The executor to framework path does not guard against the framework being 
> disconnected, unlike the status update path:
> https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495
> vs.
> https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373
> It was reported that this crash didn't occur for the user on 1.2.0, however 
> the issue appears to present there as well, so we will try to backport a test 
> to see if it's indeed not occurring in 1.2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to