Summary of IRC Meeting in #aurora at Mon Jul 13 18:01:44 2015: Attendees: thalin, dnorris, wickman, jcohen, wfarner, Yasumoto, rafik, rdelvalle, bbrazil, zmanji, dlester
- Preface - 0.9.0 release update - in-memory H2 store progress - react.js experiment for scheduler UI - Ubiqutious services (AURORA-1075) - Resource allocation for cron jobs IRC log follows: ## Preface ## [Mon Jul 13 18:01:57 2015] <wfarner>: let's start with a roll call [Mon Jul 13 18:01:59 2015] <wfarner>: here [Mon Jul 13 18:02:08 2015] <jcohen>: afternoon! [Mon Jul 13 18:02:09 2015] <dnorris>: here [Mon Jul 13 18:02:15 2015] <rafik>: here [Mon Jul 13 18:02:20 2015] <wickman>: ahoy [Mon Jul 13 18:02:56 2015] <zmanji>: here [Mon Jul 13 18:03:04 2015] <dlester>: here [Mon Jul 13 18:03:34 2015] <Yasumoto>: howdy [Mon Jul 13 18:03:55 2015] <thalin>: here [Mon Jul 13 18:05:07 2015] <rdelvalle>: here ## 0.9.0 release update ## [Mon Jul 13 18:05:40 2015] <wfarner>: AURORA-1078 [Mon Jul 13 18:06:13 2015] <wfarner>: we are in roughly the same state as last week w.r.t. readiness to cut a release candidate [Mon Jul 13 18:06:45 2015] <wfarner>: kts added AURORA-1352 as a blocker, and IIRC is in-flight on fixing, but is currently on vacation [Mon Jul 13 18:07:32 2015] <wfarner>: i assume the fix will be out for review tomorrow, completing all tickets to cut a release candidate ## in-memory H2 store progress ## [Mon Jul 13 18:09:57 2015] <wfarner>: for those who haven't been paying close attention, we/i have been on an effort to migrate the scheduler's in-memory storage to use H2 and SQL [Mon Jul 13 18:10:29 2015] <rafik>: What was the storage previously? [Mon Jul 13 18:10:44 2015] <jcohen>: it currently uses the mesos replicated log [Mon Jul 13 18:10:55 2015] <wfarner>: the in-memory layer has been hand-rolled, mostly hash maps [Mon Jul 13 18:11:02 2015] <rafik>: K [Mon Jul 13 18:11:18 2015] <jcohen>: oh, derp, memory storage ;) [Mon Jul 13 18:11:48 2015] <wfarner>: there's more background in this thread from last year: http://mail-archives.apache.org/mod_mbox/aurora-dev/201404.mbox/%3ccagra8upkh81f3ecp+f7fe84+2m+yzountecbhnptaufwqft...@mail.gmail.com%3E [Mon Jul 13 18:12:46 2015] <wfarner>: most of the functional changes are now complete, and i've done work to bring the performance up to suitable levels [Mon Jul 13 18:13:11 2015] <wfarner>: the last bit of work is working out some concurrency kinks, e.g. AURORA-1395 [Mon Jul 13 18:13:42 2015] <bbrazil>: wfarner: it uses the -hostname for web redirects, and the local IP for cli tools [Mon Jul 13 18:14:01 2015] <wfarner>: note that these issues don't apply to those using default scheduler command line settings - the most critical stores will only be switched to this system with a command line arg ## react.js experiment for scheduler UI ## [Mon Jul 13 18:15:08 2015] <wfarner>: jcohen: the floor is yours [Mon Jul 13 18:15:58 2015] <jcohen>: Iâve been doing some planning work on the future of the Scheduler UI. Thereâs a lot that can be done to clean things up, both from a tech debt perspective (no tests) and from a usability perspective. [Mon Jul 13 18:16:19 2015] <jcohen>: the first step in that process has been to evaluate alternatives to Angular. [Mon Jul 13 18:16:36 2015] <jcohen>: As such Iâve been putting together a very simple proof of concept demo using React [Mon Jul 13 18:17:01 2015] <jcohen>: if anyone has any experience w/ React and Angular and would like to discuss the pros/cons, feel free to let me know. [Mon Jul 13 18:17:09 2015] <jcohen>: I can start a thread on dev@ [Mon Jul 13 18:17:25 2015] <rafik>: I'd be very interested in helping out with a React.js port [Mon Jul 13 18:17:27 2015] <jcohen>: Otherwise, I hope to push the demo some time this week. [Mon Jul 13 18:17:44 2015] <rafik>: I've found the UI to be rather limited and non-performant so far [Mon Jul 13 18:18:02 2015] <jcohen>: I donât have a ton of React experience, so itâs a bit of a learning curve, but overall Iâm liking what Iâm seeing so far. [Mon Jul 13 18:18:30 2015] <jcohen>: The main concern is whether a full rewrite is really warranted to solve the underlying tech debt issues, or whether time is better spent simply improving the existing Angular app. [Mon Jul 13 18:19:21 2015] <jcohen>: Iâll send something out to dev@ when the demo is available, and we can have that discussion then. [Mon Jul 13 18:20:02 2015] <jcohen>: ACTION <eom> [Mon Jul 13 18:20:26 2015] <wfarner>: rafik: awesome! most committers are systems developers, so we've been historically scant for web dev chops. if you can help fill a void there it would be awesome! [Mon Jul 13 18:20:43 2015] <jcohen>: +1 [Mon Jul 13 18:20:58 2015] <rafik>: One question regarding the UI: Is it a requirement that the UI currently be served by the Aurora scheduler? [Mon Jul 13 18:21:11 2015] <jcohen>: Thatâs another topic Iâve broached internally [Mon Jul 13 18:21:20 2015] <rafik>: Or are you open to having the UI run a separate server using the Thrift (or future HTTP) api? [Mon Jul 13 18:21:30 2015] <jcohen>: Thereâs certainly a benefit to doing so, as itâs one less component to deploy. [Mon Jul 13 18:21:53 2015] <jcohen>: But being able to use, e.g., node.js to serve up the UI as an isomorphic app also has its benefits. [Mon Jul 13 18:22:17 2015] <jcohen>: In any event, I donât think there are any sacred cows as far as the UI is concerned. [Mon Jul 13 18:22:22 2015] <bbrazil>: the last time it came up the general concensus was that separate was fine for prototyping, but most people wanted it as part of the scheduler [Mon Jul 13 18:22:28 2015] <wfarner>: rafik: it's definitely not a requirement. even today, that slice of the scheduler is really just asset serving + API calls, so it is technically splittable [Mon Jul 13 18:22:48 2015] <rafik>: Maybe an in-between where you package the UI as a directory of assets but allow the directory to be overriden [Mon Jul 13 18:22:53 2015] <jcohen>: one alternative might be to allow optionally serving up a separate UI from the default scheduler UI (or a command line flag to entirely suppress the scheduler serving a UI at all) [Mon Jul 13 18:22:55 2015] <wfarner>: but +1 to fewer parts for default deployment [Mon Jul 13 18:23:05 2015] <rafik>: Consul supports a `-ui-dir` argument to update/deploy the UI portions separately [Mon Jul 13 18:23:34 2015] <rafik>: And I've found that to be quite useful in testing out changes [Mon Jul 13 18:24:01 2015] <bbrazil>: we do similarly for prometheus itself for development [Mon Jul 13 18:24:02 2015] <rafik>: See "Self-hosted Dashboard": https://www.consul.io/intro/getting-started/ui.html [Mon Jul 13 18:24:13 2015] <jcohen>: seems reasonable [Mon Jul 13 18:24:42 2015] <wfarner>: ah, yes - i would like that as well. i did some work a while back to get us close to that - all assets now live in a single path on the classpath [Mon Jul 13 18:25:02 2015] <zmanji>: I also would like that as well. It would allow for cluster operators to make zone specific ui changes if requried [Mon Jul 13 18:25:17 2015] <wfarner>: the only kink is that they don't live update, but that shouldn't be too hard to resolve [Mon Jul 13 18:26:34 2015] <wfarner>: sounds like there's continued interest here. i suggest that jcohen and rafik continue offline, and jcohen starts a dev@ thread shortly after [Mon Jul 13 18:26:48 2015] <jcohen>: sounds good to me [Mon Jul 13 18:26:59 2015] <rafik>: Same [Mon Jul 13 18:27:24 2015] <wfarner>: last call for topics, i'll otherwise close in 3 mins ## Ubiqutious services (AURORA-1075) ## [Mon Jul 13 18:28:26 2015] <wfarner>: rafik: floor is yours [Mon Jul 13 18:29:10 2015] <rafik>: I'm wondering if there's been any update on the proposal that Anindya has opened [Mon Jul 13 18:29:29 2015] <jcohen>: AURORA-1075 [Mon Jul 13 18:29:31 2015] <rafik>: Link to proposal here: https://docs.google.com/document/d/12hr6GnUZU3mc7xsWRzMi3nQILGB-3vyUxvbG-6YmvdE/edit [Mon Jul 13 18:30:10 2015] <rafik>: We're interested in supporting this for our installation [Mon Jul 13 18:30:26 2015] <rafik>: But don't want to invest too much time in exploring it if there's independent momentum [Mon Jul 13 18:30:53 2015] <rafik>: Also not sure that we have resources that could contribute it yet, especially as it seems to require quite a few changes across existing pieces [Mon Jul 13 18:31:06 2015] <wfarner>: i have not heard of anyone writing code related to that proposal [Mon Jul 13 18:31:33 2015] <zmanji>: the proposal also has some (unresolved?) issues in the coments [Mon Jul 13 18:32:10 2015] <wfarner>: i will take a pass through the document today to see if i can help drive any of those to resolution. [Mon Jul 13 18:32:24 2015] <rafik>: Thanks wfarner, that would be great [Mon Jul 13 18:32:41 2015] <rafik>: <eom> [Mon Jul 13 18:32:46 2015] <rafik>: One more topic: ## Resource allocation for cron jobs ## [Mon Jul 13 18:33:11 2015] <rafik>: This came up on Thursday evening, but we didn't seem to get a complete answer [Mon Jul 13 18:33:23 2015] <rafik>: What's the story re: reserving quota for cron jobs? [Mon Jul 13 18:33:47 2015] <rafik>: We have quite a few cron jobs that run very infrequently and are currently reserving useful quota [Mon Jul 13 18:34:03 2015] <wickman>: i'm with rafik: why? imho i think quota should be deducted at runtime for all tasks, including those spawned by crons. [Mon Jul 13 18:34:04 2015] <rafik>: Is there any interest in rethinking the quota reservation for cron jobs? [Mon Jul 13 18:34:12 2015] <zmanji>: It used to be not this case [Mon Jul 13 18:34:15 2015] <zmanji>: but maxim reverted it [Mon Jul 13 18:34:21 2015] <wickman>: bring back PENDING: insufficient quota [Mon Jul 13 18:34:34 2015] <wfarner>: that was never a thing [Mon Jul 13 18:34:37 2015] <jcohen>: Cron jobs used to run regardless of quota, right? [Mon Jul 13 18:34:50 2015] <wfarner>: quota checks have always been at job submission time [Mon Jul 13 18:34:54 2015] <zmanji>: jcohen: yes and that was a hole because then a role could consume more production resources than its quota [Mon Jul 13 18:34:59 2015] <jcohen>: right [Mon Jul 13 18:35:01 2015] <rafik>: Right, but cron scheduling isn't job submission [Mon Jul 13 18:35:12 2015] <rafik>: My understanding is that crons are scheduled and then the Aurora scheduler submits jobs for them later [Mon Jul 13 18:35:23 2015] <rafik>: Shouldn't the quota check happen then? as if it were an ad-hoc job? [Mon Jul 13 18:35:48 2015] <wfarner>: rafik: i would support that [Mon Jul 13 18:36:15 2015] <wfarner>: though we don't really have a means to give feedback about that right now [Mon Jul 13 18:36:16 2015] <rafik>: It sounds like that used to be existing behavior? [Mon Jul 13 18:36:23 2015] <rafik>: Is it just a matter of reverting a change? [Mon Jul 13 18:36:26 2015] <wfarner>: nope, that behavior never existed [Mon Jul 13 18:36:48 2015] <rafik>: Re: feedback. You mean give feedback to operators that their jobs aren't running? [Mon Jul 13 18:37:15 2015] <wfarner>: correct, we currently lack a "pending job" concept, only pending tasks [Mon Jul 13 18:37:45 2015] <rafik>: Is that true? I see "Pending: Insufficient CPU", etc. for jobs all the time in the UI [Mon Jul 13 18:37:53 2015] <bbrazil>: I'd argue that sort of feedback should be from your monitoring system in the first instance, though the UI should expose something about it [Mon Jul 13 18:37:54 2015] <wfarner>: right - that's for tasks [Mon Jul 13 18:38:26 2015] <rafik>: wfarner: not sure I understand the difference [Mon Jul 13 18:38:30 2015] <wfarner>: say we were to launch cron jobs and hold its tasks in PENDING due to insufficient quota. what does that mean for service tasks when they restart? [Mon Jul 13 18:38:46 2015] <wfarner>: do service tasks also wait because a cron iteration is holding up resources? [Mon Jul 13 18:39:13 2015] <rafik>: I would prefer they do, yes [Mon Jul 13 18:39:19 2015] <bbrazil>: I'd expect quota to be at a job level, so tasks restarting/updating wouldn't be affected [Mon Jul 13 18:39:20 2015] <rafik>: Assuming the crons are marked `production`, etc. [Mon Jul 13 18:39:38 2015] <bbrazil>: a new job may not be accepted in that case, or updates changing resource usage [Mon Jul 13 18:39:45 2015] <wickman>: wfarner: FIFO queue [Mon Jul 13 18:39:48 2015] <wickman>: wfarner: of pending tasks [Mon Jul 13 18:39:57 2015] <rafik>: This really depends on how people are using cron, but in our particular use case, we have cron jobs that are related to long-running services [Mon Jul 13 18:40:11 2015] <rafik>: It's been brought up before, but some concept of job "groups" may actually go towards resolving this [Mon Jul 13 18:40:23 2015] <rafik>: E.g. you can say my offline payments service has these 5 cron jobs that need to run [Mon Jul 13 18:40:41 2015] <rafik>: In that case, Aurora could do something like reserve the maximum of the cron job resources [Mon Jul 13 18:40:49 2015] <rafik>: And only allow one job to run at a time for instance [Mon Jul 13 18:40:59 2015] <rafik>: Obviously not applicable in all circumstances [Mon Jul 13 18:41:12 2015] <bbrazil>: I could see that for a dependency setup, not sure about the more general case you're proposing [Mon Jul 13 18:41:14 2015] <wfarner>: yeah, and could make things even harder to reason about [Mon Jul 13 18:41:18 2015] <rafik>: But some concept of pooling together crons so that they would share resources might be useful [Mon Jul 13 18:41:47 2015] <wfarner>: rafik: i definitely agree that as a user i should be able to deliberately stagger my cron jobs to time-share quota [Mon Jul 13 18:41:49 2015] <rafik>: Okay, perhaps better to table the job group discussion for now then [Mon Jul 13 18:42:10 2015] <rafik>: Yeah, for background ~60% of our quota is reserved by cron jobs right now [Mon Jul 13 18:42:18 2015] <rafik>: Most of which only run on the order of once a day, or once a week [Mon Jul 13 18:42:47 2015] <rafik>: Aurora could make some attempt to reserve cron based on the job schedules [Mon Jul 13 18:43:04 2015] <rafik>: I.e. recognize crons as being non-overlapping [Mon Jul 13 18:43:13 2015] <rafik>: But that assumes some knowledge of their run-time, I suppose [Mon Jul 13 18:43:21 2015] <wfarner>: right [Mon Jul 13 18:43:38 2015] <bbrazil>: and gets more complicated if something else is using the resources it wants [Mon Jul 13 18:43:46 2015] <rafik>: Right [Mon Jul 13 18:43:53 2015] <wfarner>: IMHO a pending job submission is the easiest to think about from an operator and user perspective [Mon Jul 13 18:44:04 2015] <bbrazil>: +1 [Mon Jul 13 18:44:05 2015] <wfarner>: (not to be confused with a pending task) [Mon Jul 13 18:44:06 2015] <rafik>: +1 [Mon Jul 13 18:44:26 2015] <wickman>: really, why pending job? [Mon Jul 13 18:44:46 2015] <wickman>: and not just PENDING: insufficient quota + a FIFO queue [Mon Jul 13 18:45:05 2015] <wickman>: are you worried about reduced availability of flapping service tasks? [Mon Jul 13 18:45:10 2015] <wickman>: that's what priority is for [Mon Jul 13 18:45:23 2015] <wfarner>: wickman: good point w.r.t. priority [Mon Jul 13 18:46:04 2015] <bbrazil>: I think that job admission control should be separate from task scheduling and restart handling [Mon Jul 13 18:46:12 2015] <wfarner>: checking the code, the scheduler does appropriately use priority within a role [Mon Jul 13 18:47:46 2015] <wfarner>: another behavior supporting bbrazil's statement - the current approach is immune to a cluster administrator fat-fingering a user's quota [Mon Jul 13 18:48:24 2015] <wfarner>: if quota is considered during task scheduling, there's a larger potential impact [Mon Jul 13 18:48:47 2015] <wfarner>: though this could also be argued for checking quota while kicking off a cron run [Mon Jul 13 18:48:48 2015] <bbrazil>: it'd also require providing more quota to be safe to handle task scheduling [Mon Jul 13 18:49:24 2015] <wickman>: i'm of the opposite belief -- we should go even further and evaluate quota for running tasks every time quota changes [Mon Jul 13 18:49:33 2015] <wickman>: in other words, reducing quota can actually preempt tasks and make them go PENDING: insufficient quota [Mon Jul 13 18:49:51 2015] <bbrazil>: you want the quota given to directly be what you want the user to be able to use, if the administrator has to add safety/fudge factors that makes resource less manageable [Mon Jul 13 18:51:00 2015] <wfarner>: we're running long on the meeting. rafik: i suggest you carry this discussion to dev@ so that we may continue offline [Mon Jul 13 18:51:12 2015] <rafik>: Sure [Mon Jul 13 18:51:26 2015] <wfarner>: closing up now, thanks for the interesting discussions, everyone! [Mon Jul 13 18:51:29 2015] <wfarner>: ASFBot: meeting stop Meeting ended at Mon Jul 13 18:51:29 2015