I chatted with Jie and Gaston, and here is a brief summary: 1. The ordering issue between the synchronous response and the event stream would lead to extra complication for a framework, and thus the benefit doesn't seem to worth the complication. 2. However, we should consider not forwarding the reconciliation requests to the agents. The status updates doesn't require a trigger, and if the agent could report gone and unregistered RPs to the master, the master can respond to the reconciliation request itself. The only problem I see is that frameworks may see `OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` -> `OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.
To address the original problem of MESOS-9318, we could do the following: (1) Agent is gone => `OPERATION_GONE_BY_OPERATOR` (2) Agent is unreachable => `OPERATION_UNREACHABLE` (3) Agent is not registered => `OPERATION_RECOVERING` (4) Agent is unknown => `OPERATION_UNKNOWN` (5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR` (6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or `OPERATION_RECOVERING` (7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN` (8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`? So it seems a number of people agree with going with the asynchronous responses through the event stream. Please reply if you have other opinions! On Thu, Jan 24, 2019 at 1:39 PM James DeFelice <james.defel...@gmail.com> wrote: > I've attempted to implement support for operation status reconciliation in > a framework that I've been building. Option (III) seems most convenient > from my perspective as well. A single source of updates: > > (a) Leads to a cleaner framework design; I've had to poke a few holes in > the framework's initial design to deal with multiple event sources, leading > to increased complexity. > > (b) Allows frameworks to consume events in the order they arrive (and > pushes the responsibility for event ordering back to Mesos). Multiple event > sources that the framework needs to (possibly) reorder based on a timestamp > would add further complexity that we should avoid pushing onto framework > writers. > > Some other thoughts: > > (c) I've implemented a background polling loop for exactly the reason that > Benno pointed out. An asychronous API call for operation status > reconciliation would be fine with me. > > (d) API consistency is important. Framework devs are used to the way that > the task status reconciliation API works, and have come up with solutions > for dealing with the lack of boundaries for streams of explicit > reconciliation events. The synchronous response defined for the currently > published operation status reconciliation call isn't consistent with the > rest of the v1 scheduler API, which generated a bit of extra work (for me) > in the low-level mesos v1 http client lib. Consistency should be a primary > goal when extending existing API sets. > > (e) There are probably other ways to solve the problem of a "lack of > boundaries within the event stream" for explicit reconciliation requests. > If this is this a problem that other framework devs need solved then let's > address it as a separate issue - and aim to resolve it in a consistent way > for both task and operation status event streams. > > (f) It sounds like option (III) would let Mesos send back smarter > operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN). > Anything to limit the number of scenarios where UNKNOWN is returned to > frameworks sounds good to me. > > -James > > > > On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier < > benjamin.bann...@mesosphere.io> wrote: > >> Hi, >> >> have we reached a conclusion here? >> >> From the Mesos side of things I would be strongly in favor of proposal >> (III). This is not only consistent with what we do with task status >> updates, but also would allow us to provide improved operation status >> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to >> better distinguish non-terminal from terminal operation states. To >> accomplish that we wouldn’t need to introduce extra information leakage >> (e.g., explicitly keeping master up to date on local resource provider >> state and associated internal consistency complications). >> >> This approach should also simplify framework development as a framework >> would only need to watch a single channel to see operation status updates >> (no need to reconcile different information sources). The benefits of >> better status updates and simpler implementation IMO outweigh any benefits >> of the current approach (disclaimer: I filed the slightly inflammatory >> MESOS-9448). >> >> What is keeping us from moving forward with (III) at this point? >> >> >> Cheers, >> >> Benjamin >> >> > On Jan 3, 2019, at 11:30 PM, Benno Evers <bev...@mesosphere.com> wrote: >> > >> > Hi Chun-Hung, >> > >> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per >> node, then the master need to maintain 20k entries for LRPs. >> > >> > How big would the required additional storage be in this scenario? Even >> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad >> for such a big custer. >> > >> > In general, it seems hard to discuss the trade-offs between your >> proposals without looking at the users of that API - do you know if there >> are ayn frameworks out there that already use >> > operation reconciliation, and if so what do they do based on the >> reconciliation response? >> > >> > As far as I know, we don't have any formal guarantees on which >> operations status changes the framework will receive without >> reconciliation. So putting on my framework-implementer hat it seems like >> I'd have no choice but to implement a continously polling background loop >> anyways if I care about knowing the latest operation statuses. If this is >> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to >> have little additional benefit. >> > >> > Best regards, >> > Benno >> > >> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <chhs...@apache.org> >> wrote: >> > Hi folks, >> > >> > Recently I've being discussing the problems of the current design of the >> > experimental >> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The >> discussion >> > was started >> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>: >> when a >> > framework receives an `OPERATION_UNKNOWN`, it doesn't know >> > if it should retry the operation or not (further details described >> below). >> > As the discussion >> > evolves, we realize there are more issues to consider, design-wise and >> > implementation-wise, so >> > I'd like to reach out to the community to get valuable opinions from you >> > guys. >> > >> > Before I jump right into the issues I'd like to discuss, let me fill you >> > guys in with some >> > background of operation reconciliation. Since the design of this feature >> > was informed by the >> > pre-existing implementation of task reconciliation, I'll begin there. >> > >> > *Task Reconciliation: Design* >> > >> > The scheduler API has a `RECONCILE` call for a framework to query the >> > current statuses >> > of its tasks. This call supports the following modes: >> > >> > - *Explicit reconciliation*: The framework specifies the list of >> tasks >> > it wants to know >> > about, and expects status updates for these tasks. >> > >> > - *Implicit reconciliation*: The framework does not specify a list of >> > tasks, and simply >> > expects status updates for all tasks the master knows about. >> > >> > In both cases, the master looks into its in-memory task bookkeeping and >> > sends >> > *one or more`UPDATE` events* to respond to the reconciliation request. >> > >> > *Task Reconciliation: Problems* >> > >> > This API design of task reconciliation has the following shortcomings: >> > >> > - (1) There is no clear boundary of when the "reconciliation >> response" >> > ends, and thus >> > there is >> > *no 1-1 correspondence between the reconciliation request and the >> response*. >> > For explicit reconciliation, the framework might wait for an >> extended period >> > of time before it receives all status updates; for implicit >> > reconciliation, there is no way for >> > a framework to tell if it has learned about all of its tasks, which >> > could be inconvenient if >> > the framework has lost its task bookkeeping. >> > >> > - (2) The "reconciliation response" may be outdated. If an agent >> > reregisters after a task >> > reconciliation has been responded, >> > *the framework wouldn't learn about the tasks **from this recovered >> agent*. >> > Mesos relies on the framework to call the `RECONCILE` call >> > *periodically* to get up-to-date task statuses. >> > >> > >> > >> > *Operation Reconciliation: Design & Problems* >> > >> > When designing operation reconciliation, we made the >> `RECONCILE_OPERATIONS` >> > call >> > *asynchronous request-response style call* that returns a 200 OK with a >> > list of operation status >> > to avoid (1). However, this design does not resolve (2), and also >> > introduces new problems: >> > >> > - (3) *The synchronous response could race with the event stream* and >> > the framework >> > does not know which contains the latest operation status. >> > >> > - (4) To ensure scalability, the master does not manage local >> resource >> > providers (LRPs); >> > the agents do. So the master cannot tell if an LRP is temporarily >> > unreachable/recovering >> > or permanently gone. As a result, if the framework explicitly >> reconciles >> > an LRP operation >> > that the master does not know about, it can only reply >> > `OPERATION_UNKNOWN`, but >> > then *the framework would not know if the operation would come back >> in >> > the future*, >> > and thus cannot decide if it should reissue another operation, which >> > leads to MESOS-9318 < >> https://issues.apache.org/jira/browse/MESOS-9318>. >> > >> > Note that this is less of a problem for explicit task reconciliation, >> > because in most cases >> > the master can infer task statuses from agent statuses, and in the >> rare >> > cases that it >> > replies `TASK_UNKNOWN`, it is generally safe for the framework to >> > relaunch another >> > task. >> > >> > >> > *The Open Question* >> > >> > Now, the big question here is: >> > *are the benefits of a synchronous request-responsestyle >> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in >> order to >> > address (3) and (4) in the code? To explain what the complexity would >> be, >> > let me lay out a >> > couple proposals we've been discussing: >> > >> > I. Keep `RECONCILE_OPERATIONS` synchronous >> > >> > To address (3), we could add a *timestamp* to every operation status as >> > well as the >> > reconciliation response, so the framework can infer which one is the >> latest >> > status, and if >> > it receives a stale operation status update after the reconciliation >> > response, it can just >> > ack the status update without updating its bookkeeping. But, the >> framework >> > needs to >> > deal with a corner case: >> > >> > *when it receives a reconciliation response containing aterminal >> operation >> > status, it may or may not receive one or more status updatesfor that >> > operation later *because of the race. >> > >> > >> > To address (4), we could either: (a) surface the unreachable/gone LRPs >> to >> > the master, or >> > (b) forward the explicit reconciliation request to the corresponding >> agent. >> > The complexity >> > of (a) is that >> > *it might not be scalable for the master to maintain the list >> ofunreachable >> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone >> > LRPs per node, then the master need to maintain 20k entries for LRPs. >> The >> > complexity >> > of (b) is that the response wouldn't be computed based on the master's >> > state; instead, >> > *the master needs to wait for the agent's reply to respond to the >> framework*. >> > Note >> > that it's probably not scalable to forward implicit reconciliation >> requests >> > to all agents, so >> > implicit reconciliation might have to still be responded based on the >> > master's state. >> > >> > >> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous" >> > >> > Instead of returning a 200 OK, the master could return a 202 Accepted >> with >> > an empty >> > body, and then >> > *reply a single event containing the operation status of all >> > requestedoperations in the event stream asynchronously*. Although the >> > framework loses the >> > 1-1 correspondence between the request and the response, there's still a >> > clear boundary >> > for a reconciliation response. The advantage of this approach compared >> to >> > proposal I is >> > that we don't have a race between the reconciliation response and the >> event >> > stream, so >> > no timestamp is required. Still, we have to address (4) through either >> (a) >> > or (b) described >> > above, thus the complexity remains. That said, this approach fits with >> (b) >> > better since no >> > synchronous response is needed. >> > >> > >> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger >> > >> > This would be similar to what we have for task reconciliation. The >> master >> > would return a >> > 202 Accepted, and then send >> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an >> > implicit reconciliation, or >> > *forward the request to some agent*for an explicit reconciliation. In >> other >> > words, this call plays the role of a trigger of the >> > operation status updates. This approach is the simplest in terms of the >> > implementation, >> > but the trade-off is that the framework needs to live with (1). >> > >> > >> > So far we haven't discussed much about (2) for operation >> reconciliation, so >> > let's also briefly talk >> > about it. Potentially (2) can be addressed by making the agent *actively >> > push * >> > *operation statusupdates to the framework when an LRP is resubscribed*, >> so >> > the framework won't need to do >> > periodic operation reconciliation. If we do this in the future, it would >> > also be more aligned with >> > proposal II or III. >> > >> > So the question again: is it worth the complexity to keep >> > `RECONCILE_OPERATIONS` >> > synchronous? I'd like to hear the opinions from the community so we can >> > drive towards a better >> > API design! >> > >> > Best, >> > Chun-Hung >> > >> > >> > -- >> > Benno Evers >> > Software Engineer, Mesosphere >> >> > > -- > James DeFelice > 585.241.9488 (voice) > 650.649.6071 (fax) >