Hi guys,

I asked Anand to sent that email out, so let me chime in.

First off, looks like the motivation for this change was not properly
communicated to the list. I was under the impression that we had a public
discussion about this when I originally introduced this behavior (back in
2013). But I can't seem to find such a thread, so maybe the discussions
were only internal. Sorry about that. I suggest we use this thread for
discussion.

Some background. This behavior was introduced originally as an optimization
because back then the scheduler driver didn't have a way to inform the
master about disconnections. So, for example, when the driver was
disconnected from master for prolonged periods (e.g., ZK is down), the
scheduler had no way of knowing that all its calls (including
launchTasks()) were being dropped. Since then we have added a few things to
the driver that reduces the need for this optimization. 1) disconnected()
callback was added to inform the scheduler of disconnection and 2)
reconciliation was added.

Admittedly, there are still races where a driver knows that it's
disconnected whereas the scheduler itself doesn't know it yet (e.g.,
disconnected() callback is queued). But this is no different from a race
where the disconnected event was itself queued on the driver, or the
launchTask message(s) was dropped after leaving the driver (e.g., master
goes down right when the message leaves the driver).

We wanted to remove this optimization because it's a bit hacky. For example
this is the only call that behaves differently when disconnected (e.g,
killTask drops the message silently when disconnected). The TASK_LOST
update's 'source' is set as 'SOURCE_MASTER', even though it is generated by
the driver. We also had to add a bunch of code in the driver to deal with
this special case status update. Finally, when we move to HTTP API there
will be no driver that schedulers will depend on. Note that a scheduler
using HTTP client to detect disconnections (in the HTTP API future) is very
similar to getting disconnected() from the driver. There are still races to
consider and it's best for the schedulers to be robust and consistent in
handling such cases.

Having said that, I guess it's OK for the schedulers to depend on this
crutch in the driver if they really want to. We can keep this hack in place
until we sunset the driver itself in favor of the HTTP API.

HTH,

On Tue, Jun 23, 2015 at 10:33 AM, Dave Lester <d...@davelester.org> wrote:

> Hi Marco and Anand,
>
> I see a difference between a brief conversation on the JIRA issue, and
> creating a separate thread to propose a breaking change -- particularly
> when it's one that affects framework writers who may not be active in
> the day-to-day changes of the core. Now that JIRA issue emails are sent
> to dev@ instead to issues@, I think it's even more-important that
> separate threads are created on dev@ to discuss such changes prior to
> having them committed.
>
> It looks like there's at least one comment on the JIRA issue since this
> notice went to the list, which is discussion that should really be
> happening up front rather than after it's committed.
>
> Lastly, I think it's important that we break out of the practice of
> encouraging folks to communicate off-list for changes like this. I
> understand your comment to "talk to BenH for greater detail" was meant
> with good intent, but I think it's important that we as a community
> collectively engage on the mailing list rather than relying on
> out-of-band communication when making decisions.
>
> Thanks for understanding and hearing me out!
>
> Dave
>
> On Tue, Jun 23, 2015, at 10:01 AM, Marco Massenzio wrote:
> > Hey Dave,
> >
> > sorry about the confusion, but the "deprecation cycle" is happening: this
> > change won't take place until 0.24 is out (as the title of this email
> > states); this will obviously be captured in the update notes from 0.23 to
> > 0.24: as you correctly pointed out, we wanted to give folks very early
> > notice of the impending change.
> >
> > The conversation has actually taken place on the MESOS-1988 ticket (
> > https://issues.apache.org/jira/browse/MESOS-1988) which also gets
> > forwarded
> > to the issues@ mailing list; this was also proposed and shepherded by
> > Vinod, so I would recommend you follow up with him if you want to further
> > clarify matters.
> >
> > In our limited understanding, this was an "undocumented" behavior so we
> > would expect the impact to be minor and the suggested solution to be a
> > more
> > desirable behavior.
> >
> > Please also feel free to reach out to Ben H to discuss in greater depth.
> >
> > Thanks for being vigilant!
> >
> > *Marco Massenzio*
> > *Distributed Systems Engineer*
> >
> > On Mon, Jun 22, 2015 at 9:38 PM, Dave Lester <d...@davelester.org>
> wrote:
> >
> > > Hi Anand,
> > >
> > > Was there a discussion thread on this?
> > >
> > > Breaking changes should only be introduced when the community has had a
> > > chance to discuss its impact and any necessary deprecation cycle -- I
> > > didn't see a discussion on the relevant thread, but perhaps I missed
> > > something?
> > >
> > > Thanks,
> > > Dave
> > >
> > > On Mon, Jun 22, 2015, at 05:23 PM, Anand Mazumdar wrote:
> > > > Hi All,
> > > >
> > > > We intend to introduce a breaking change [1] in the driver to
> silently
> > > > ignore launchTasks/acceptOffers(…) calls when disconnected from the
> > > > master in 0.24. The previous behavior was to send out “TASK_LOST”
> > > > messages since there was no way to know that these task launches were
> > > > dropped. However , with the advent of Task Reconciliation, this
> feature
> > > > is redundant. Other calls like killTask/requestResource et al
> already had
> > > > this behavior.
> > > >
> > > > If your existing framework relied on this behavior, I would
> encourage you
> > > > to use the Task Reconciliation API [2] in lieu of this feature/hack.
> Let
> > > > me know if you have any queries/concerns.
> > > >
> > > > Links:
> > > > [1] Tracking JIRA: https://issues.apache.org/jira/browse/MESOS-1988
> > > > <https://issues.apache.org/jira/browse/MESOS-1988>
> > > > [2] Task Reconciliation API :
> > > > http://mesos.apache.org/documentation/latest/reconciliation/
> > > > <http://mesos.apache.org/documentation/latest/reconciliation/>
> > > >
> > > > -anand
> > >
> > >
>

Reply via email to