Hi,
We have a production pipeline running a series of jobs, with each job
creating a custom mesos framework to execute all tasks related to that job.
Both scheduler and executor are written using the Python mesos API.
Here's a snippet (modified for brevity) of the scheduler code:

class MyMesosScheduler(mesos.Scheduler):
<...>
  def run_jobs(self, jobs):
    for job in jobs:
      framework = mesos_pb2.FrameworkInfo()
      <...>
      driver = MesosSchedulerDriver(self, framework, Flags.mesos_master)
      driver.start()
      for task in job.generate_tasks():
        <...>
      # wait for all tasks to complete
      driver.stop()
<...>

This usually works just fine, but sometimes the pipeline gets "stuck" on
the second framework, and I can see on the mesos dashboard that the first
framework is still "active":
[image: Inline image 1]

I know driver.stop() for the first framework was called and has returned
(from my logs, and from the fact that the following job started). I also
see this in the console where the scheduler is running:
I0514 07:05:16.819201 29486 sched.cpp:1286] Asked to stop the driver

*If I add time.sleep(0.1) after driver.stop() the problem disappears!*
I tried adding driver.join() after driver.stop(), but the behavior was the
same (the join() returned immediately).
I tried adding "del driver" after driver.stop(), but the behavior was the
same.

*So my questions are:*
- What is promised to me by the mesos API on return from "driver.stop()" ?
(I thought the promise is that the framework successfully stopped,
including stopping all executors)
- Is it safe to "recycle" the same instance of MyMesosScheduler for
multiple (consecutive, not overlapping) frameworks? (note that the driver
object is brand new for every framework)
- Any thoughts on the problem I'm describing, and potential solutions that
are not based on a voodoo sleep?

Thanks!
- Itamar.

Reply via email to