James Peach created MESOS-3834: ---------------------------------- Summary: slave upgrade framework checkpoint incompatibility Key: MESOS-3834 URL: https://issues.apache.org/jira/browse/MESOS-3834 Project: Mesos Issue Type: Bug Affects Versions: 0.24.1 Reporter: James Peach Assignee: James Peach
We are upgrading from 0.22 to 0.25 and experienced the following crash in the 0.24 slave: {code} F1104 05:20:49.162701 1153 slave.cpp:4175] Check failed: frameworkInfo.has_id() *** Check failure stack trace: *** @ 0x7fef9c294650 google::LogMessage::Fail() @ 0x7fef9c29459f google::LogMessage::SendToLog() @ 0x7fef9c293fb0 google::LogMessage::Flush() @ 0x7fef9c296ce4 google::LogMessageFatal::~LogMessageFatal() @ 0x7fef9b9a5492 mesos::internal::slave::Slave::recoverFramework() @ 0x7fef9b9a3314 mesos::internal::slave::Slave::recover() @ 0x7fef9b9d069c _ZZN7process8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS4_5state5StateEES9_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSG_FSE_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESP_ @ 0x7fef9ba039f4 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ {code} As near as I can tell, what happened was this: - 0.22 wrote {{framework.info}} without the FrameworkID - 0.23 had a compatibility check so it was ok with it - 0.24 removed the compatibility check in MESOS-2259 - the framework checkpoint doesn't get rewritten during recovery so when the 0.24 slave starts it reads the 0.22 version - 0.24 asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332)