[jira] [Commented] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.
[ https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702527#comment-16702527 ] Chun-Hung Hsiao commented on MESOS-9419: Backported to 1.7.x, 1.6.x, 1.5.x, and the unofficially-maintained 1.4.x as well. 1.7.x: {noformat} commit bd74257ff8ab8d7bd305aa694c3cd7cbd6840af0 Author: Chun-Hung Hsiao Date: Mon Nov 26 20:12:36 2018 -0800 Fixed master crash when executors send messages to recovered frameworks. The `Framework::send` function assumes that either `http` or `pid` is set, which is not true for a framework that hasn't yet reregistered yet but recovered from a reregistered agent. As a result, the master would crash when a recovered executor tries to send a message to such a framework (see MESOS-9419). This patch fixes this crash bug. Review: https://reviews.apache.org/r/69451{noformat} 1.6.x: {noformat} commit 2d7cb6b60d6cdd3c1dbe1470f0afa044ae78c10c Author: Chun-Hung Hsiao Date: Mon Nov 26 20:12:36 2018 -0800 Fixed master crash when executors send messages to recovered frameworks. The `Framework::send` function assumes that either `http` or `pid` is set, which is not true for a framework that hasn't yet reregistered yet but recovered from a reregistered agent. As a result, the master would crash when a recovered executor tries to send a message to such a framework (see MESOS-9419). This patch fixes this crash bug. Review: https://reviews.apache.org/r/69451{noformat} 1.5.x: {noformat} commit d27d057b7769eafa3e967763a073a2841520e050 Author: Chun-Hung Hsiao Date: Mon Nov 26 20:12:36 2018 -0800 Fixed master crash when executors send messages to recovered frameworks. The `Framework::send` function assumes that either `http` or `pid` is set, which is not true for a framework that hasn't yet reregistered yet but recovered from a reregistered agent. As a result, the master would crash when a recovered executor tries to send a message to such a framework (see MESOS-9419). This patch fixes this crash bug. Review: https://reviews.apache.org/r/69451{noformat} > Executor to framework message crashes master if framework has not > re-registered. > > > Key: MESOS-9419 > URL: https://issues.apache.org/jira/browse/MESOS-9419 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.3.0, 1.3.1, 1.3.2, 1.4.0, > 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0 >Reporter: Benjamin Mahler >Assignee: Chun-Hung Hsiao >Priority: Blocker > Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0 > > > If the executor sends a framework message after a master failover, and the > framework has not yet re-registered with the master, this will crash the > master: > {code} > W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send > message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- > (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb) > F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE > *** Check failure stack trace: *** > *** @ 0x7f09e016b6cd google::LogMessage::Fail() > *** @ 0x7f09e016d38d google::LogMessage::SendToLog() > *** @ 0x7f09e016b2b3 google::LogMessage::Flush() > *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal() > *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal() > *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>() > *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage() > *** @ 0x7f09df3b06a4 > _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\ > > _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\ > _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 > std::_Function_handler<>::_M_invoke() > *** @ 0x7f09df36930f ProtobufProcess<>::consume() > *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume() > *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume() > *** @ 0x7f09e00d9c7a process::ProcessManager::resume() > *** @ 0x7f09e00dd836 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > *** @ 0x7f09dd467ac8 execute_native_thread_routine > *** @ 0x7f09dd6f6b50 start_thread > *** @ 0x7f09dcc7030d (unknown) > {code} > This is because Framework::send proceeds if the framework is disconnected. In > the case of a recovered framework, it will not have a pid or http connection > yet: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610 > {code} > // Sends a message to the connected framework. > template > void Framework::send(const Message& message) > { > if (!conne
[jira] [Commented] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.
[ https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701297#comment-16701297 ] Chun-Hung Hsiao commented on MESOS-9419: I can confirm that the unit test will crash the master on 1.2.0: {noformat} I1127 17:59:18.429213 35500 executor.cpp:192] Version: 1.2.0 ... W1127 17:59:18.488634 35478 master.hpp:2261] Master attempted to send message to disconnected framework e687230b-c9b1-471d-b8e2-e9b380229227- (default) F1127 17:59:18.488682 35478 master.hpp:2271] CHECK_SOME(pid): is NONE *** Check failure stack trace: *** @ 0x7f3e4023963d (unknown) @ 0x7f3e40238a79 (unknown) @ 0x7f3e4023931c (unknown) @ 0x7f3e4023c6a9 (unknown) @ 0x9830d4 _CheckFatal::~_CheckFatal() @ 0x7f3e3ead1389 (unknown) @ 0x7f3e3ea60b78 (unknown) @ 0x7f3e3eb47f5f (unknown) @ 0x7f3e3eb4961e (unknown) @ 0x7f3e3eb49356 (unknown) @ 0x7f3e3eb49167 (unknown) @ 0x1272d20 std::function<>::operator()() @ 0x7f3e3eac2ecb (unknown) @ 0x7f3e3ea6db29 (unknown) @ 0x7f3e3ea6d183 (unknown) @ 0x7f3e3eaf62ae (unknown) @ 0x970921 process::ProcessBase::serve() @ 0x7f3e40167945 (unknown) @ 0x7f3e40170b96 (unknown) @ 0x7f3e40170ab5 (unknown) @ 0x7f3e40170a85 (unknown) @ 0x7f3e40170969 (unknown) @ 0x7f3e40398fcf (unknown) @ 0x7f3e38c67e25 start_thread @ 0x7f3e37b5334d __clone {noformat} > Executor to framework message crashes master if framework has not > re-registered. > > > Key: MESOS-9419 > URL: https://issues.apache.org/jira/browse/MESOS-9419 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.3.0, 1.3.1, 1.3.2, 1.4.0, > 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0 >Reporter: Benjamin Mahler >Assignee: Chun-Hung Hsiao >Priority: Blocker > > If the executor sends a framework message after a master failover, and the > framework has not yet re-registered with the master, this will crash the > master: > {code} > W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send > message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- > (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb) > F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE > *** Check failure stack trace: *** > *** @ 0x7f09e016b6cd google::LogMessage::Fail() > *** @ 0x7f09e016d38d google::LogMessage::SendToLog() > *** @ 0x7f09e016b2b3 google::LogMessage::Flush() > *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal() > *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal() > *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>() > *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage() > *** @ 0x7f09df3b06a4 > _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\ > > _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\ > _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 > std::_Function_handler<>::_M_invoke() > *** @ 0x7f09df36930f ProtobufProcess<>::consume() > *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume() > *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume() > *** @ 0x7f09e00d9c7a process::ProcessManager::resume() > *** @ 0x7f09e00dd836 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > *** @ 0x7f09dd467ac8 execute_native_thread_routine > *** @ 0x7f09dd6f6b50 start_thread > *** @ 0x7f09dcc7030d (unknown) > {code} > This is because Framework::send proceeds if the framework is disconnected. In > the case of a recovered framework, it will not have a pid or http connection > yet: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610 > {code} > // Sends a message to the connected framework. > template > void Framework::send(const Message& message) > { > if (!connected()) { > LOG(WARNING) << "Master attempted to send message to disconnected" > << " framework " << *this; > // XXX proceeds! > } > metrics.incrementEvent(message); > if (http.isSome()) { > if (!http->send(message)) { > LOG(WARNING) << "Unable to send event to framework " << *this << ":" ><< " connection closed"; > } > } else { > CHECK_SOME(pid); // XXX Will crash. > master->send(pid.get(), message); > } > } > {code} > The executor to framework path does not guard against t
[jira] [Commented] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.
[ https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700455#comment-16700455 ] Tomas Barton commented on MESOS-9419: - [~alexr] we've run into the issue on Mesos 1.4.1 (see [MESOS-8623|https://issues.apache.org/jira/browse/MESOS-8623]), according to {{git blame}} the [would appear in {{1.2.0}}|https://issues.apache.org/jira/browse/MESOS-8623?focusedCommentId=16380688&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16380688]. > Executor to framework message crashes master if framework has not > re-registered. > > > Key: MESOS-9419 > URL: https://issues.apache.org/jira/browse/MESOS-9419 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.1, 1.6.1, 1.7.0 >Reporter: Benjamin Mahler >Assignee: Chun-Hung Hsiao >Priority: Blocker > > If the executor sends a framework message after a master failover, and the > framework has not yet re-registered with the master, this will crash the > master: > {code} > W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send > message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- > (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb) > F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE > *** Check failure stack trace: *** > *** @ 0x7f09e016b6cd google::LogMessage::Fail() > *** @ 0x7f09e016d38d google::LogMessage::SendToLog() > *** @ 0x7f09e016b2b3 google::LogMessage::Flush() > *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal() > *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal() > *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>() > *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage() > *** @ 0x7f09df3b06a4 > _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\ > > _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\ > _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 > std::_Function_handler<>::_M_invoke() > *** @ 0x7f09df36930f ProtobufProcess<>::consume() > *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume() > *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume() > *** @ 0x7f09e00d9c7a process::ProcessManager::resume() > *** @ 0x7f09e00dd836 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > *** @ 0x7f09dd467ac8 execute_native_thread_routine > *** @ 0x7f09dd6f6b50 start_thread > *** @ 0x7f09dcc7030d (unknown) > {code} > This is because Framework::send proceeds if the framework is disconnected. In > the case of a recovered framework, it will not have a pid or http connection > yet: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610 > {code} > // Sends a message to the connected framework. > template > void Framework::send(const Message& message) > { > if (!connected()) { > LOG(WARNING) << "Master attempted to send message to disconnected" > << " framework " << *this; > // XXX proceeds! > } > metrics.incrementEvent(message); > if (http.isSome()) { > if (!http->send(message)) { > LOG(WARNING) << "Unable to send event to framework " << *this << ":" ><< " connection closed"; > } > } else { > CHECK_SOME(pid); // XXX Will crash. > master->send(pid.get(), message); > } > } > {code} > The executor to framework path does not guard against the framework being > disconnected, unlike the status update path: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495 > vs. > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373 > It was reported that this crash didn't occur for the user on 1.2.0, however > the issue appears to present there as well, so we will try to backport a test > to see if it's indeed not occurring in 1.2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.
[ https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700415#comment-16700415 ] Alexander Rukletsov commented on MESOS-9419: I'd like to understand, why the user has not observed the issue prior to \{{1.5.x}}. [~chhsia0], when you say the issue "appears to be present there as well", does it mean you run your test against \{{1.0.x}}? > Executor to framework message crashes master if framework has not > re-registered. > > > Key: MESOS-9419 > URL: https://issues.apache.org/jira/browse/MESOS-9419 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.1, 1.6.1, 1.7.0 >Reporter: Benjamin Mahler >Assignee: Chun-Hung Hsiao >Priority: Blocker > > If the executor sends a framework message after a master failover, and the > framework has not yet re-registered with the master, this will crash the > master: > {code} > W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send > message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- > (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb) > F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE > *** Check failure stack trace: *** > *** @ 0x7f09e016b6cd google::LogMessage::Fail() > *** @ 0x7f09e016d38d google::LogMessage::SendToLog() > *** @ 0x7f09e016b2b3 google::LogMessage::Flush() > *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal() > *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal() > *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>() > *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage() > *** @ 0x7f09df3b06a4 > _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\ > > _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\ > _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 > std::_Function_handler<>::_M_invoke() > *** @ 0x7f09df36930f ProtobufProcess<>::consume() > *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume() > *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume() > *** @ 0x7f09e00d9c7a process::ProcessManager::resume() > *** @ 0x7f09e00dd836 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > *** @ 0x7f09dd467ac8 execute_native_thread_routine > *** @ 0x7f09dd6f6b50 start_thread > *** @ 0x7f09dcc7030d (unknown) > {code} > This is because Framework::send proceeds if the framework is disconnected. In > the case of a recovered framework, it will not have a pid or http connection > yet: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610 > {code} > // Sends a message to the connected framework. > template > void Framework::send(const Message& message) > { > if (!connected()) { > LOG(WARNING) << "Master attempted to send message to disconnected" > << " framework " << *this; > // XXX proceeds! > } > metrics.incrementEvent(message); > if (http.isSome()) { > if (!http->send(message)) { > LOG(WARNING) << "Unable to send event to framework " << *this << ":" ><< " connection closed"; > } > } else { > CHECK_SOME(pid); // XXX Will crash. > master->send(pid.get(), message); > } > } > {code} > The executor to framework path does not guard against the framework being > disconnected, unlike the status update path: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495 > vs. > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373 > It was reported that this crash didn't occur for the user on 1.2.0, however > the issue appears to present there as well, so we will try to backport a test > to see if it's indeed not occurring in 1.2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)