[ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973514#comment-15973514 ]
Benjamin Mahler commented on MESOS-7389: ---------------------------------------- As far as I can tell, fixing this to support pre-1.0 agents is complicated and is likely to produce its own subtle bugs. 1.2+ masters maintain an invariant that each task / executor has a known allocation role (and it can determine this given 1.0+ agents report their frameworks). Whereas if we were to support pre-1.0 agents against a 1.2+ master, the master would have to be updated to handle tasks that have an unknown allocation role (i.e. what used to be called "orphaned" tasks). A partial fix here would be to handle the case where the framework is already re-registered, leaving the "orphaned" task case triggering this check. [~neilc] [~vinodkone] The handling of pre-1.0 agents in the context of "orphaned tasks" already appears have issues, e.g.: * Master upgraded to 1.2.x * Pre- 1.0 agent re-registers with task and task's framework id, doesn't send the FrameworkInfos. * This task's framework hasn't re-registered yet, so this is what we used to call an "orphan task". * The re-registration handling drops the task, see [here|https://github.com/apache/mesos/blob/1.2.0/src/master/master.cpp#L5784-L5807]. * Later, when this framework re-registers, the task is absent in the master but known to the agent. Is this broken or am I missing something? If broken, given that fixing this ticket requires a complicated solution, and we didn't originally intend to support pre-1.0 upgrades for > 1.0.x masters, I'd be inclined to not support it (and possibly cherry-pick safety checks like MESOS-6975). > Mesos 1.2.0 crashes with pre-1.0 Mesos agents > --------------------------------------------- > > Key: MESOS-7389 > URL: https://issues.apache.org/jira/browse/MESOS-7389 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.2.0 > Environment: Ubuntu 14.04 > Reporter: Nicholas Studt > Assignee: Benjamin Mahler > Priority: Critical > Labels: mesosphere > > During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with > the running leader caused the leader to terminate. All 3 of the masters > suffered the same failure as the same slave node reregistered against the new > leader, this continued across the entire cluster until the offending slave > node was removed and fixed. The fix to the slave node was to remove the mesos > directory and then start the slave node back up. > F0412 17:24:42.736600 6317 master.cpp:5701] Check failed: > frameworks_.contains(task.framework_id()) > *** Check failure stack trace: *** > @ 0x7f59f944f94d google::LogMessage::Fail() > @ 0x7f59f945177d google::LogMessage::SendToLog() > @ 0x7f59f944f53c google::LogMessage::Flush() > @ 0x7f59f9452079 google::LogMessageFatal::~LogMessageFatal() > I0412 17:24:42.750300 6316 replica.cpp:693] Replica received learned notice > for position 6896 from @0.0.0.0:0 > @ 0x7f59f88f2341 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f59f88f488f > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f59f93c3eb1 process::ProcessManager::resume() > @ 0x7f59f93ccd57 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f59f77cfa60 (unknown) > @ 0x7f59f6fec184 start_thread > @ 0x7f59f6d19bed (unknown) -- This message was sent by Atlassian JIRA (v6.3.15#6346)