[jira] [Updated] (MESOS-1821) CHECK failure in master.

Benjamin Mahler (JIRA) Fri, 19 Sep 2014 11:35:53 -0700

     [ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benjamin Mahler updated MESOS-1821:
-----------------------------------
    Description: 
Looks like the recent CHECKs I've added exposed a bug in the framework 
re-registration logic by which we didn't keep the executors consistent between 
the Slave and Framework structs:

{noformat: title=Master Log}
I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 
at slave(1)@IP:5051 (HOSTNAME) exited with status 0
I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with 
resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 
at slave(1)@IP:5051 (HOSTNAME)

F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, 
executorId) Unknown executor aurora.gc of framework 
201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0
*** Check failure stack trace: ***
    @     0x7fd16c81737d  google::LogMessage::Fail()
    @     0x7fd16c8191c4  google::LogMessage::SendToLog()
    @     0x7fd16c816f6c  google::LogMessage::Flush()
    @     0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
    @     0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
    @     0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
    @     0x7fd16c348269  ProtobufProcess<>::handler4<>()
    @     0x7fd16c2fc18e  std::_Function_handler<>::_M_invoke()
    @     0x7fd16c322132  ProtobufProcess<>::visit()
    @     0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
    @     0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
    @     0x7fd16c7c2502  process::ProcessManager::resume()
    @     0x7fd16c7c280c  process::schedule()
    @     0x7fd16b9c683d  start_thread
    @     0x7fd16a2b626d  clone
{noformat}

This occurs sometime after a failover and indicates that the Slave and 
Framework structs are not kept in sync.

Problem seems to be here, when re-registering a framework on a failed over 
master, we only consider executors for which there are tasks stored in the 
master:

{code}
void Master::_reregisterFramework(
    const UPID& from,
    const FrameworkInfo& frameworkInfo,
    bool failover,
    const Future<Option<Error> >& validationError)
{
  ...

  if (frameworks.registered.count(frameworkInfo.id()) > 0) {
    ...
  } else {
    // We don't have a framework with this ID, so we must be a newly
    // elected Mesos master to which either an existing scheduler or a
    // failed-over one is connecting. Create a Framework object and add
    // any tasks it has that have been reported by reconnecting slaves.
    Framework* framework =
      new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
    framework->reregisteredTime = Clock::now();

    // TODO(benh): Check for root submissions like above!

    // Add any running tasks reported by slaves for this framework.
    foreachvalue (Slave* slave, slaves.registered) {
      foreachkey (const FrameworkID& frameworkId, slave->tasks) {
        foreachvalue (Task* task, slave->tasks[frameworkId]) {
          if (framework->id == task->framework_id()) {
            framework->addTask(task);

            // Also add the task's executor for resource accounting
            // if it's still alive on the slave and we've not yet
            // added it to the framework.
            if (task->has_executor_id() &&
                slave->hasExecutor(framework->id, task->executor_id()) &&
                !framework->hasExecutor(slave->id, task->executor_id())) {

              // XXX: If an executor has no tasks, the executor will not
              // XXX: be added to the Framework struct!

              const ExecutorInfo& executorInfo =
                slave->executors[framework->id][task->executor_id()];
              framework->addExecutor(slave->id, executorInfo);
            }
          }
        }
      }
    }

    // N.B. Need to add the framework _after_ we add its tasks
    // (above) so that we can properly determine the resources it's
    // currently using!
    addFramework(framework);
  }
}
{code}

  was:
Looks like the recent CHECKs I've added exposed a bug in the framework 
re-registration logic by which we didn't keep the executors consistent between 
the Slave and Framework structs:

{noformat: title=Master Log}
I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 
at slave(1)@10.34.110.134:5051 (smf1-aeg-35-sr3.prod.twitter.com) exited with 
status 0
I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with 
resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 
at slave(1)@10.34.110.134:5051 (smf1-aeg-35-sr3.prod.twitter.com)

F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, 
executorId) Unknown executor aurora.gc of framework 
201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0
*** Check failure stack trace: ***
    @     0x7fd16c81737d  google::LogMessage::Fail()
    @     0x7fd16c8191c4  google::LogMessage::SendToLog()
    @     0x7fd16c816f6c  google::LogMessage::Flush()
    @     0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
    @     0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
    @     0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
    @     0x7fd16c348269  ProtobufProcess<>::handler4<>()
    @     0x7fd16c2fc18e  std::_Function_handler<>::_M_invoke()
    @     0x7fd16c322132  ProtobufProcess<>::visit()
    @     0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
    @     0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
    @     0x7fd16c7c2502  process::ProcessManager::resume()
    @     0x7fd16c7c280c  process::schedule()
    @     0x7fd16b9c683d  start_thread
    @     0x7fd16a2b626d  clone
{noformat}

This occurs sometime after a failover and indicates that the Slave and 
Framework structs are not kept in sync.

Problem seems to be here, when re-registering a framework on a failed over 
master, we only consider executors for which there are tasks stored in the 
master:

{code}
void Master::_reregisterFramework(
    const UPID& from,
    const FrameworkInfo& frameworkInfo,
    bool failover,
    const Future<Option<Error> >& validationError)
{
  ...

  if (frameworks.registered.count(frameworkInfo.id()) > 0) {
    ...
  } else {
    // We don't have a framework with this ID, so we must be a newly
    // elected Mesos master to which either an existing scheduler or a
    // failed-over one is connecting. Create a Framework object and add
    // any tasks it has that have been reported by reconnecting slaves.
    Framework* framework =
      new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
    framework->reregisteredTime = Clock::now();

    // TODO(benh): Check for root submissions like above!

    // Add any running tasks reported by slaves for this framework.
    foreachvalue (Slave* slave, slaves.registered) {
      foreachkey (const FrameworkID& frameworkId, slave->tasks) {
        foreachvalue (Task* task, slave->tasks[frameworkId]) {
          if (framework->id == task->framework_id()) {
            framework->addTask(task);

            // Also add the task's executor for resource accounting
            // if it's still alive on the slave and we've not yet
            // added it to the framework.
            if (task->has_executor_id() &&
                slave->hasExecutor(framework->id, task->executor_id()) &&
                !framework->hasExecutor(slave->id, task->executor_id())) {

              // XXX: If an executor has no tasks, the executor will not
              // XXX: be added to the Framework struct!

              const ExecutorInfo& executorInfo =
                slave->executors[framework->id][task->executor_id()];
              framework->addExecutor(slave->id, executorInfo);
            }
          }
        }
      }
    }

    // N.B. Need to add the framework _after_ we add its tasks
    // (above) so that we can properly determine the resources it's
    // currently using!
    addFramework(framework);
  }
}
{code}


> CHECK failure in master.
> ------------------------
>
>                 Key: MESOS-1821
>                 URL: https://issues.apache.org/jira/browse/MESOS-1821
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.21.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> Looks like the recent CHECKs I've added exposed a bug in the framework 
> re-registration logic by which we didn't keep the executors consistent 
> between the Slave and Framework structs:
> {noformat: title=Master Log}
> I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
> 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 
> at slave(1)@IP:5051 (HOSTNAME) exited with status 0
> I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
> with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
> 201103282247-0000000019-0000 on slave 20140905-173231-1890854154-5050-31333-0 
> at slave(1)@IP:5051 (HOSTNAME)
> F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
> hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
> 201103282247-0000000019-0000 of slave 20140905-173231-1890854154-5050-31333-0
> *** Check failure stack trace: ***
>     @     0x7fd16c81737d  google::LogMessage::Fail()
>     @     0x7fd16c8191c4  google::LogMessage::SendToLog()
>     @     0x7fd16c816f6c  google::LogMessage::Flush()
>     @     0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
>     @     0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
>     @     0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
>     @     0x7fd16c348269  ProtobufProcess<>::handler4<>()
>     @     0x7fd16c2fc18e  std::_Function_handler<>::_M_invoke()
>     @     0x7fd16c322132  ProtobufProcess<>::visit()
>     @     0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
>     @     0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
>     @     0x7fd16c7c2502  process::ProcessManager::resume()
>     @     0x7fd16c7c280c  process::schedule()
>     @     0x7fd16b9c683d  start_thread
>     @     0x7fd16a2b626d  clone
> {noformat}
> This occurs sometime after a failover and indicates that the Slave and 
> Framework structs are not kept in sync.
> Problem seems to be here, when re-registering a framework on a failed over 
> master, we only consider executors for which there are tasks stored in the 
> master:
> {code}
> void Master::_reregisterFramework(
>     const UPID& from,
>     const FrameworkInfo& frameworkInfo,
>     bool failover,
>     const Future<Option<Error> >& validationError)
> {
>   ...
>   if (frameworks.registered.count(frameworkInfo.id()) > 0) {
>     ...
>   } else {
>     // We don't have a framework with this ID, so we must be a newly
>     // elected Mesos master to which either an existing scheduler or a
>     // failed-over one is connecting. Create a Framework object and add
>     // any tasks it has that have been reported by reconnecting slaves.
>     Framework* framework =
>       new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
>     framework->reregisteredTime = Clock::now();
>     // TODO(benh): Check for root submissions like above!
>     // Add any running tasks reported by slaves for this framework.
>     foreachvalue (Slave* slave, slaves.registered) {
>       foreachkey (const FrameworkID& frameworkId, slave->tasks) {
>         foreachvalue (Task* task, slave->tasks[frameworkId]) {
>           if (framework->id == task->framework_id()) {
>             framework->addTask(task);
>             // Also add the task's executor for resource accounting
>             // if it's still alive on the slave and we've not yet
>             // added it to the framework.
>             if (task->has_executor_id() &&
>                 slave->hasExecutor(framework->id, task->executor_id()) &&
>                 !framework->hasExecutor(slave->id, task->executor_id())) {
>               // XXX: If an executor has no tasks, the executor will not
>               // XXX: be added to the Framework struct!
>               const ExecutorInfo& executorInfo =
>                 slave->executors[framework->id][task->executor_id()];
>               framework->addExecutor(slave->id, executorInfo);
>             }
>           }
>         }
>       }
>     }
>     // N.B. Need to add the framework _after_ we add its tasks
>     // (above) so that we can properly determine the resources it's
>     // currently using!
>     addFramework(framework);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1821) CHECK failure in master.

Reply via email to