Maxim and I discussed offline after the Aurora dev sync today and we think the best way forward is improving the `getJobUpdateDetails` RPC to take a `JobUpdateQuery`.
I'll be filing a ticket and moving forward on this. On Fri, Sep 2, 2016 at 3:36 PM, Maxim Khutornenko <ma...@apache.org> wrote: > The pulseJobUpdate RPC was added as an entirely new feature. This > proposal suggests a read-only "convenience" method to pull data that > is already available via existing means. > > As for "taking down the scheduler", this argument is moot. It's > possible to DDOS scheduler via existing RPCs today and the suggested > alternative has the same inherent risk of crafting an unscoped query > that will pull all updates from the store. Also, unless there are a > LOT of those queries (read dos-attack that OOMs scheduler) it's > unlikely that the scheduler will sustain any damage as read-only > queries don't acquire any global or table locks. > > On Fri, Sep 2, 2016 at 3:24 PM, Zameer Manji <zma...@apache.org> wrote: > > I'm not convinced by your argument about adding a read only RPC that's > not > > covered by the traditional integration test is going to cause bit rot. We > > already have an RPC that is not covered by the traditional integration > test > > `pulseJobUpdate` and it hasn't bit rotted AFAIK. > > > > Further some members of the community have been writing alternative > clients > > around our API and I think adding this RPC will better support their use > > cases. `JobUpdate` is a first class struct in our API and I think it > makes > > sense to expose a query interface for it. > > > > If integration tests are your primary concern, I can also add an > > integration test for this. > > > > Regarding adding `JobUpdateQuery` to `getJobUpdateDetails`, I'm no longer > > convinced that it's the best way to go because of the risk it contains. > > It's all to easy to craft a query that can pull a lot of data that can > take > > down the scheduler. In my case, I sometimes see it being useful for > getting > > all of the `JobUpdates` of a role (completed and active), but that risks > > pulling a lot of unnecessary data. > > > > I will also add that since `JobUpdate` is a first class struct in our > > storage and APIs, this RPC contains very minimal code. I think it's > highly > > unlikely that it will bit rot. > > > > On Fri, Sep 2, 2016 at 2:52 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > > > >> As I mentioned in Slack, I am ok with the new RPC as long as there is > >> a use for it elsewhere in the client or UI. Adding a read-only RPC > >> that isn't going to be called by our traditional integration test > >> clients sets a fertile ground for bit rot. > >> > >> I am actually warming up to your original proposal of adding > >> JobUpdateQuery into the existing getJobUpdateDetails RPC. While it may > >> be more expensive to pull multiple updates, we don't necessarily risk > >> much after we migrated to MVStore on the H2 side. There are no table > >> locks acquired and the only downside would be pulling events along > >> with what you need. Provided the query is narrowly scoped, that should > >> deliver acceptable performance. > >> > >> On Thu, Sep 1, 2016 at 2:24 PM, Zameer Manji <zma...@apache.org> wrote: > >> > Hey, > >> > > >> > I've noticed a hole in our current API which makes it difficult to > write > >> > external clients and other tooling around fetching the state of > updates. > >> > > >> > Currently, to fetch updates we are given two RPCs: > >> > ```` > >> > /** Gets job update summaries. */ > >> > Response getJobUpdateSummaries(1: JobUpdateQuery jobUpdateQuery) > >> > > >> > /** Gets job update details. */ > >> > Response getJobUpdateDetails(1: JobUpdateKey key) > >> > > >> > ```` > >> > > >> > The `getJobUpdateSummaries` RPC is not scoped to a single update and > >> > returns a > >> > set of `JobUpdateSummary` structs. The struct is defined: > >> > ```` > >> > /** Summary of the job update including job key, user and current > state. > >> */ > >> > struct JobUpdateSummary { > >> > /** Unique identifier for the update. */ > >> > 5: JobUpdateKey key > >> > > >> > /** User initiated an update. */ > >> > 3: string user > >> > > >> > /** Current job update state. */ > >> > 4: JobUpdateState state > >> > } > >> > ```` > >> > > >> > The `getJobUpdateDetails` RPC is scoped to a single update and returns > >> the > >> > following struct: > >> > > >> > ```` > >> > struct JobUpdateDetails { > >> > /** Update definition. */ > >> > 1: JobUpdate update > >> > > >> > /** History for this update. */ > >> > 2: list<JobUpdateEvent> updateEvents > >> > > >> > /** History for the individual instances updated. */ > >> > 3: list<JobInstanceUpdateEvent> instanceEvents > >> > } > >> > > >> > ```` > >> > > >> > Maxim mentioned to me yesterday that this RPC is scoped to a single > >> update > >> > because assembling the `instanceEvents` can be extremely expensive. A > >> query > >> > that > >> > could span more than a single update risks taking down the scheduler > in a > >> > large > >> > cluster. > >> > > >> > > >> > The problem I discovered is that there is no batch API to get the > >> > inexpensive > >> > information inside the `JobUpdate` struct. For reference this struct > >> > contains: > >> > > >> > ```` > >> > /** Full definition of the job update. */ > >> > struct JobUpdate { > >> > /** Update summary. */ > >> > 1: JobUpdateSummary summary > >> > > >> > /** Update configuration. */ > >> > 2: JobUpdateInstructions instructions > >> > } > >> > ```` > >> > > >> > Consumers are forced to make several `getJobUpdateDetails` calls to > get > >> > multiple > >> > `JobUpdate` structs. Since the `JobUpdate` struct is not expensive to > >> > assemble, > >> > I'm proposing a new RPC that will allow consumers to get several > >> `JobUpdate` > >> > structs in a single call. > >> > > >> > ```` > >> > /** Gets job updates. */ > >> > Response getJobUpdates(1: JobUpdateQuery jobUpdateQuery) > >> > ```` > >> > > >> > If there are no objections, I will file tickets and put up a patch to > >> > implement > >> > this. > >> > > >> > -- > >> > Zameer Manji > >> >