[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-1821: ----------------------------------- Description: Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess<>::handler4<>() @ 0x7fd16c2fc18e std::_Function_handler<>::_M_invoke() @ 0x7fd16c322132 ProtobufProcess<>::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID& from, const FrameworkInfo& frameworkInfo, bool failover, const Future<Option<Error> >& validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) > 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework->reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID& frameworkId, slave->tasks) { foreachvalue (Task* task, slave->tasks[frameworkId]) { if (framework->id == task->framework_id()) { framework->addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task->has_executor_id() && slave->hasExecutor(framework->id, task->executor_id()) && !framework->hasExecutor(slave->id, task->executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo& executorInfo = slave->executors[framework->id][task->executor_id()]; framework->addExecutor(slave->id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} was: Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@10.34.110.134:5051 (smf1-aeg-35-sr3.prod.twitter.com) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@10.34.110.134:5051 (smf1-aeg-35-sr3.prod.twitter.com) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess<>::handler4<>() @ 0x7fd16c2fc18e std::_Function_handler<>::_M_invoke() @ 0x7fd16c322132 ProtobufProcess<>::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID& from, const FrameworkInfo& frameworkInfo, bool failover, const Future<Option<Error> >& validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) > 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework->reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID& frameworkId, slave->tasks) { foreachvalue (Task* task, slave->tasks[frameworkId]) { if (framework->id == task->framework_id()) { framework->addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task->has_executor_id() && slave->hasExecutor(framework->id, task->executor_id()) && !framework->hasExecutor(slave->id, task->executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo& executorInfo = slave->executors[framework->id][task->executor_id()]; framework->addExecutor(slave->id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} > CHECK failure in master. > ------------------------ > > Key: MESOS-1821 > URL: https://issues.apache.org/jira/browse/MESOS-1821 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 0.21.0 > Reporter: Benjamin Mahler > Assignee: Benjamin Mahler > > Looks like the recent CHECKs I've added exposed a bug in the framework > re-registration logic by which we didn't keep the executors consistent > between the Slave and Framework structs: > {noformat: title=Master Log} > I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework > 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 > at slave(1)@IP:5051 (HOSTNAME) exited with status 0 > I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' > with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework > 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 > at slave(1)@IP:5051 (HOSTNAME) > F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: > hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework > 201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0 > *** Check failure stack trace: *** > @ 0x7fd16c81737d google::LogMessage::Fail() > @ 0x7fd16c8191c4 google::LogMessage::SendToLog() > @ 0x7fd16c816f6c google::LogMessage::Flush() > @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() > @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() > @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() > @ 0x7fd16c348269 ProtobufProcess<>::handler4<>() > @ 0x7fd16c2fc18e std::_Function_handler<>::_M_invoke() > @ 0x7fd16c322132 ProtobufProcess<>::visit() > @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() > @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() > @ 0x7fd16c7c2502 process::ProcessManager::resume() > @ 0x7fd16c7c280c process::schedule() > @ 0x7fd16b9c683d start_thread > @ 0x7fd16a2b626d clone > {noformat} > This occurs sometime after a failover and indicates that the Slave and > Framework structs are not kept in sync. > Problem seems to be here, when re-registering a framework on a failed over > master, we only consider executors for which there are tasks stored in the > master: > {code} > void Master::_reregisterFramework( > const UPID& from, > const FrameworkInfo& frameworkInfo, > bool failover, > const Future<Option<Error> >& validationError) > { > ... > if (frameworks.registered.count(frameworkInfo.id()) > 0) { > ... > } else { > // We don't have a framework with this ID, so we must be a newly > // elected Mesos master to which either an existing scheduler or a > // failed-over one is connecting. Create a Framework object and add > // any tasks it has that have been reported by reconnecting slaves. > Framework* framework = > new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); > framework->reregisteredTime = Clock::now(); > // TODO(benh): Check for root submissions like above! > // Add any running tasks reported by slaves for this framework. > foreachvalue (Slave* slave, slaves.registered) { > foreachkey (const FrameworkID& frameworkId, slave->tasks) { > foreachvalue (Task* task, slave->tasks[frameworkId]) { > if (framework->id == task->framework_id()) { > framework->addTask(task); > // Also add the task's executor for resource accounting > // if it's still alive on the slave and we've not yet > // added it to the framework. > if (task->has_executor_id() && > slave->hasExecutor(framework->id, task->executor_id()) && > !framework->hasExecutor(slave->id, task->executor_id())) { > // XXX: If an executor has no tasks, the executor will not > // XXX: be added to the Framework struct! > const ExecutorInfo& executorInfo = > slave->executors[framework->id][task->executor_id()]; > framework->addExecutor(slave->id, executorInfo); > } > } > } > } > } > // N.B. Need to add the framework _after_ we add its tasks > // (above) so that we can properly determine the resources it's > // currently using! > addFramework(framework); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)