Re: Discussion: Scheduler API for Operation Reconciliation

Chun-Hung Hsiao Thu, 24 Jan 2019 19:19:57 -0800

I chatted with Jie and Gaston, and here is a brief summary:

1. The ordering issue between the synchronous response and the event stream
would lead to extra complication for a framework, and thus the benefit
doesn't seem to worth the complication.
2. However, we should consider not forwarding the reconciliation requests
to the agents. The status updates doesn't require a trigger, and if the
agent could report gone and unregistered RPs to the master, the master can
respond to the reconciliation request itself.
The only problem I see is that frameworks may see
`OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
`OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.


To address the original problem of MESOS-9318, we could do the following:
(1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
(2) Agent is unreachable => `OPERATION_UNREACHABLE`
(3) Agent is not registered => `OPERATION_RECOVERING`
(4) Agent is unknown => `OPERATION_UNKNOWN`
(5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
(6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
`OPERATION_RECOVERING`
(7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
(8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?

So it seems a number of people agree with going with the asynchronous
responses through the event stream. Please reply if you have other opinions!

On Thu, Jan 24, 2019 at 1:39 PM James DeFelice <[email protected]>
wrote:

> I've attempted to implement support for operation status reconciliation in
> a framework that I've been building. Option (III) seems most convenient
> from my perspective as well. A single source of updates:
>
> (a) Leads to a cleaner framework design; I've had to poke a few holes in
> the framework's initial design to deal with multiple event sources, leading
> to increased complexity.
>
> (b) Allows frameworks to consume events in the order they arrive (and
> pushes the responsibility for event ordering back to Mesos). Multiple event
> sources that the framework needs to (possibly) reorder based on a timestamp
> would add further complexity that we should avoid pushing onto framework
> writers.
>
> Some other thoughts:
>
> (c) I've implemented a background polling loop for exactly the reason that
> Benno pointed out. An asychronous API call for operation status
> reconciliation would be fine with me.
>
> (d) API consistency is important. Framework devs are used to the way that
> the task status reconciliation API works, and have come up with solutions
> for dealing with the lack of boundaries for streams of explicit
> reconciliation events. The synchronous response defined for the currently
> published operation status reconciliation call isn't consistent with the
> rest of the v1 scheduler API, which generated a bit of extra work (for me)
> in the low-level mesos v1 http client lib. Consistency should be a primary
> goal when extending existing API sets.
>
> (e) There are probably other ways to solve the problem of a "lack of
> boundaries within the event stream" for explicit reconciliation requests.
> If this is this a problem that other framework devs need solved then let's
> address it as a separate issue - and aim to resolve it in a consistent way
> for both task and operation status event streams.
>
> (f) It sounds like option (III) would let Mesos send back smarter
> operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> Anything to limit the number of scenarios where UNKNOWN is returned to
> frameworks sounds good to me.
>
> -James
>
>
>
> On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> [email protected]> wrote:
>
>> Hi,
>>
>> have we reached a conclusion here?
>>
>> From the Mesos side of things I would be strongly in favor of proposal
>> (III). This is not only consistent with what we do with task status
>> updates, but also would allow us to provide improved operation status
>> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
>> better distinguish non-terminal from terminal operation states. To
>> accomplish that we wouldn’t need to introduce extra information leakage
>> (e.g., explicitly keeping master up to date on local resource provider
>> state and associated internal consistency complications).
>>
>> This approach should also simplify framework development as a framework
>> would only need to watch a single channel to see operation status updates
>> (no need to reconcile different information sources). The benefits of
>> better status updates and simpler implementation IMO outweigh any benefits
>> of the current approach (disclaimer: I filed the slightly inflammatory
>> MESOS-9448).
>>
>> What is keeping us from moving forward with (III) at this point?
>>
>>
>> Cheers,
>>
>> Benjamin
>>
>> > On Jan 3, 2019, at 11:30 PM, Benno Evers <[email protected]> wrote:
>> >
>> > Hi Chun-Hung,
>> >
>> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
>> node, then the master need to maintain 20k entries for LRPs.
>> >
>> > How big would the required additional storage be in this scenario? Even
>> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
>> for such a big custer.
>> >
>> > In general, it seems hard to discuss the trade-offs between your
>> proposals without looking at the users of that API - do you know if there
>> are ayn frameworks out there that already use
>> >  operation reconciliation, and if so what do they do based on the
>> reconciliation response?
>> >
>> > As far as I know, we don't have any formal guarantees on which
>> operations status changes the framework will receive without
>> reconciliation. So putting on my framework-implementer hat it seems like
>> I'd have no choice but to implement a continously polling background loop
>> anyways if I care about knowing the latest operation statuses. If this is
>> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
>> have little additional benefit.
>> >
>> > Best regards,
>> > Benno
>> >
>> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <[email protected]>
>> wrote:
>> > Hi folks,
>> >
>> > Recently I've being discussing the problems of the current design of the
>> > experimental
>> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The
>> discussion
>> > was started
>> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
>> when a
>> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
>> > if it should retry the operation or not (further details described
>> below).
>> > As the discussion
>> > evolves, we realize there are more issues to consider, design-wise and
>> > implementation-wise, so
>> > I'd like to reach out to the community to get valuable opinions from you
>> > guys.
>> >
>> > Before I jump right into the issues I'd like to discuss, let me fill you
>> > guys in with some
>> > background of operation reconciliation. Since the design of this feature
>> > was informed by the
>> > pre-existing implementation of task reconciliation, I'll begin there.
>> >
>> > *Task Reconciliation: Design*
>> >
>> > The scheduler API has a `RECONCILE` call for a framework to query the
>> > current statuses
>> > of its tasks. This call supports the following modes:
>> >
>> >    - *Explicit reconciliation*: The framework specifies the list of
>> tasks
>> >    it wants to know
>> >    about, and expects status updates for these tasks.
>> >
>> >    - *Implicit reconciliation*: The framework does not specify a list of
>> >    tasks, and simply
>> >    expects status updates for all tasks the master knows about.
>> >
>> > In both cases, the master looks into its in-memory task bookkeeping and
>> > sends
>> > *one or more`UPDATE` events* to respond to the reconciliation request.
>> >
>> > *Task Reconciliation: Problems*
>> >
>> > This API design of task reconciliation has the following shortcomings:
>> >
>> >    - (1) There is no clear boundary of when the "reconciliation
>> response"
>> >    ends, and thus
>> >    there is
>> > *no 1-1 correspondence between the reconciliation request and the
>> response*.
>> >    For explicit reconciliation, the framework might wait for an
>> extended period
>> >    of time before it receives all status updates; for implicit
>> >    reconciliation, there is no way for
>> >    a framework to tell if it has learned about all of its tasks, which
>> >    could be inconvenient if
>> >    the framework has lost its task bookkeeping.
>> >
>> >    - (2) The "reconciliation response" may be outdated. If an agent
>> >    reregisters after a task
>> >    reconciliation has been responded,
>> > *the framework wouldn't learn about the tasks **from this recovered
>> agent*.
>> >    Mesos relies on the framework to call the `RECONCILE` call
>> >    *periodically* to get up-to-date task statuses.
>> >
>> >
>> >
>> > *Operation Reconciliation: Design & Problems*
>> >
>> > When designing operation reconciliation, we made the
>> `RECONCILE_OPERATIONS`
>> > call
>> > *asynchronous request-response style call* that returns a 200 OK with a
>> > list of operation status
>> > to avoid (1). However, this design does not resolve (2), and also
>> > introduces new problems:
>> >
>> >    - (3) *The synchronous response could race with the event stream* and
>> >    the framework
>> >    does not know which contains the latest operation status.
>> >
>> >    - (4) To ensure scalability, the master does not manage local
>> resource
>> >    providers (LRPs);
>> >    the agents do. So the master cannot tell if an LRP is temporarily
>> >    unreachable/recovering
>> >    or permanently gone. As a result, if the framework explicitly
>> reconciles
>> >    an LRP operation
>> >    that the master does not know about, it can only reply
>> >    `OPERATION_UNKNOWN`, but
>> >    then *the framework would not know if the operation would come back
>> in
>> >    the future*,
>> >    and thus cannot decide if it should reissue another operation, which
>> >    leads to MESOS-9318 <
>> https://issues.apache.org/jira/browse/MESOS-9318>.
>> >
>> >    Note that this is less of a problem for explicit task reconciliation,
>> >    because in most cases
>> >    the master can infer task statuses from agent statuses, and in the
>> rare
>> >    cases that it
>> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
>> >    relaunch another
>> >    task.
>> >
>> >
>> > *The Open Question*
>> >
>> > Now, the big question here is:
>> > *are the benefits of a synchronous request-responsestyle
>> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in
>> order to
>> > address (3) and (4) in the code? To explain what the complexity would
>> be,
>> > let me lay out a
>> > couple proposals we've been discussing:
>> >
>> > I. Keep `RECONCILE_OPERATIONS` synchronous
>> >
>> > To address (3), we could add a *timestamp* to every operation status as
>> > well as the
>> > reconciliation response, so the framework can infer which one is the
>> latest
>> > status, and if
>> > it receives a stale operation status update after the reconciliation
>> > response, it can just
>> > ack the status update without updating its bookkeeping. But, the
>> framework
>> > needs to
>> > deal with a corner case:
>> >
>> > *when it receives a reconciliation response containing aterminal
>> operation
>> > status, it may or may not receive one or more status updatesfor that
>> > operation later *because of the race.
>> >
>> >
>> > To address (4), we could either: (a) surface the unreachable/gone LRPs
>> to
>> > the master, or
>> > (b) forward the explicit reconciliation request to the corresponding
>> agent.
>> > The complexity
>> > of (a) is that
>> > *it might not be scalable for the master to maintain the list
>> ofunreachable
>> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
>> > LRPs per node, then the master need to maintain 20k entries for LRPs.
>> The
>> > complexity
>> > of (b) is that the response wouldn't be computed based on the master's
>> > state; instead,
>> > *the master needs to wait for the agent's reply to respond to the
>> framework*.
>> > Note
>> > that it's probably not scalable to forward implicit reconciliation
>> requests
>> > to all agents, so
>> > implicit reconciliation might have to still be responded based on the
>> > master's state.
>> >
>> >
>> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
>> >
>> > Instead of returning a 200 OK, the master could return a 202 Accepted
>> with
>> > an empty
>> > body, and then
>> > *reply a single event containing the operation status of all
>> > requestedoperations in the event stream asynchronously*. Although the
>> > framework loses the
>> > 1-1 correspondence between the request and the response, there's still a
>> > clear boundary
>> > for a reconciliation response. The advantage of this approach compared
>> to
>> > proposal I is
>> > that we don't have a race between the reconciliation response and the
>> event
>> > stream, so
>> > no timestamp is required. Still, we have to address (4) through either
>> (a)
>> > or (b) described
>> > above, thus the complexity remains. That said, this approach fits with
>> (b)
>> > better since no
>> > synchronous response is needed.
>> >
>> >
>> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
>> >
>> > This would be similar to what we have for task reconciliation. The
>> master
>> > would return a
>> > 202 Accepted, and then send
>> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
>> > implicit reconciliation, or
>> > *forward the request to some agent*for an explicit reconciliation. In
>> other
>> > words, this call plays the role of a trigger of the
>> > operation status updates. This approach is the simplest in terms of the
>> > implementation,
>> > but the trade-off is that the framework needs to live with (1).
>> >
>> >
>> > So far we haven't discussed much about (2) for operation
>> reconciliation, so
>> > let's also briefly talk
>> > about it. Potentially (2) can be addressed by making the agent *actively
>> > push *
>> > *operation statusupdates to the framework when an LRP is resubscribed*,
>> so
>> > the framework won't need to do
>> > periodic operation reconciliation. If we do this in the future, it would
>> > also be more aligned with
>> > proposal II or III.
>> >
>> > So the question again: is it worth the complexity to keep
>> > `RECONCILE_OPERATIONS`
>> > synchronous? I'd like to hear the opinions from the community so we can
>> > drive towards a better
>> > API design!
>> >
>> > Best,
>> > Chun-Hung
>> >
>> >
>> > --
>> > Benno Evers
>> > Software Engineer, Mesosphere
>>
>>
>
> --
> James DeFelice
> 585.241.9488 (voice)
> 650.649.6071 (fax)
>

Re: Discussion: Scheduler API for Operation Reconciliation

Reply via email to