Summary of IRC Meeting in #aurora at Mon Jul 13 18:01:44 2015:

Attendees: thalin, dnorris, wickman, jcohen, wfarner, Yasumoto, rafik, 
rdelvalle, bbrazil, zmanji, dlester

- Preface
- 0.9.0 release update
- in-memory H2 store progress
- react.js experiment for scheduler UI
- Ubiqutious services (AURORA-1075)
- Resource allocation for cron jobs


IRC log follows:

## Preface ##
[Mon Jul 13 18:01:57 2015] <wfarner>: let's start with a roll call
[Mon Jul 13 18:01:59 2015] <wfarner>: here
[Mon Jul 13 18:02:08 2015] <jcohen>: afternoon!
[Mon Jul 13 18:02:09 2015] <dnorris>: here
[Mon Jul 13 18:02:15 2015] <rafik>: here
[Mon Jul 13 18:02:20 2015] <wickman>: ahoy
[Mon Jul 13 18:02:56 2015] <zmanji>: here
[Mon Jul 13 18:03:04 2015] <dlester>: here
[Mon Jul 13 18:03:34 2015] <Yasumoto>: howdy
[Mon Jul 13 18:03:55 2015] <thalin>: here
[Mon Jul 13 18:05:07 2015] <rdelvalle>: here
## 0.9.0 release update ##
[Mon Jul 13 18:05:40 2015] <wfarner>: AURORA-1078
[Mon Jul 13 18:06:13 2015] <wfarner>: we are in roughly the same state as last 
week w.r.t. readiness to cut a release candidate
[Mon Jul 13 18:06:45 2015] <wfarner>: kts added AURORA-1352 as a blocker, and 
IIRC is in-flight on fixing, but is currently on vacation
[Mon Jul 13 18:07:32 2015] <wfarner>: i assume the fix will be out for review 
tomorrow, completing all tickets to cut a release candidate
## in-memory H2 store progress ##
[Mon Jul 13 18:09:57 2015] <wfarner>: for those who haven't been paying close 
attention, we/i have been on an effort to migrate the scheduler's in-memory 
storage to use H2 and SQL
[Mon Jul 13 18:10:29 2015] <rafik>: What was the storage previously?
[Mon Jul 13 18:10:44 2015] <jcohen>: it currently uses the mesos replicated log
[Mon Jul 13 18:10:55 2015] <wfarner>: the in-memory layer has been hand-rolled, 
mostly hash maps
[Mon Jul 13 18:11:02 2015] <rafik>: K
[Mon Jul 13 18:11:18 2015] <jcohen>: oh, derp, memory storage ;)
[Mon Jul 13 18:11:48 2015] <wfarner>: there's more background in this thread 
from last year: 
http://mail-archives.apache.org/mod_mbox/aurora-dev/201404.mbox/%3ccagra8upkh81f3ecp+f7fe84+2m+yzountecbhnptaufwqft...@mail.gmail.com%3E
[Mon Jul 13 18:12:46 2015] <wfarner>: most of the functional changes are now 
complete, and i've done work to bring the performance up to suitable levels
[Mon Jul 13 18:13:11 2015] <wfarner>: the last bit of work is working out some 
concurrency kinks, e.g. AURORA-1395
[Mon Jul 13 18:13:42 2015] <bbrazil>: wfarner: it uses the -hostname for web 
redirects, and the local IP for cli tools
[Mon Jul 13 18:14:01 2015] <wfarner>: note that these issues don't apply to 
those using default scheduler command line settings - the most critical stores 
will only be switched to this system with a command line arg
## react.js experiment for scheduler UI ##
[Mon Jul 13 18:15:08 2015] <wfarner>: jcohen: the floor is yours
[Mon Jul 13 18:15:58 2015] <jcohen>: I’ve been doing some planning work on 
the future of the Scheduler UI. There’s a lot that can be done to clean 
things up, both from a tech debt perspective (no tests) and from a usability 
perspective.
[Mon Jul 13 18:16:19 2015] <jcohen>: the first step in that process has been to 
evaluate alternatives to Angular.
[Mon Jul 13 18:16:36 2015] <jcohen>: As such I’ve been putting together a 
very simple proof of concept demo using React
[Mon Jul 13 18:17:01 2015] <jcohen>: if anyone has any experience w/ React and 
Angular and would like to discuss the pros/cons, feel free to let me know.
[Mon Jul 13 18:17:09 2015] <jcohen>: I can start a thread on dev@
[Mon Jul 13 18:17:25 2015] <rafik>: I'd be very interested in helping out with 
a React.js port
[Mon Jul 13 18:17:27 2015] <jcohen>: Otherwise, I hope to push the demo some 
time this week.
[Mon Jul 13 18:17:44 2015] <rafik>: I've found the UI to be rather limited and 
non-performant so far
[Mon Jul 13 18:18:02 2015] <jcohen>: I don’t have a ton of React experience, 
so it’s a bit of a learning curve, but overall I’m liking what I’m seeing 
so far.
[Mon Jul 13 18:18:30 2015] <jcohen>: The main concern is whether a full rewrite 
is really warranted to solve the underlying tech debt issues, or whether time 
is better spent simply improving the existing Angular app.
[Mon Jul 13 18:19:21 2015] <jcohen>: I’ll send something out to dev@ when the 
demo is available, and we can have that discussion then.
[Mon Jul 13 18:20:02 2015] <jcohen>: ACTION <eom>
[Mon Jul 13 18:20:26 2015] <wfarner>: rafik: awesome!  most committers are 
systems developers, so we've been historically scant for web dev chops.  if you 
can help fill a void there it would be awesome!
[Mon Jul 13 18:20:43 2015] <jcohen>: +1
[Mon Jul 13 18:20:58 2015] <rafik>: One question regarding the UI: Is it a 
requirement that the UI currently be served by the Aurora scheduler?
[Mon Jul 13 18:21:11 2015] <jcohen>: That’s another topic I’ve broached 
internally
[Mon Jul 13 18:21:20 2015] <rafik>: Or are you open to having the UI run a 
separate server using the Thrift (or future HTTP) api?
[Mon Jul 13 18:21:30 2015] <jcohen>: There’s certainly a benefit to doing so, 
as it’s one less component to deploy.
[Mon Jul 13 18:21:53 2015] <jcohen>: But being able to use, e.g., node.js to 
serve up the UI as an isomorphic app also has its benefits.
[Mon Jul 13 18:22:17 2015] <jcohen>: In any event, I don’t think there are 
any sacred cows as far as the UI is concerned.
[Mon Jul 13 18:22:22 2015] <bbrazil>: the last time it came up the general 
concensus was that separate was fine for prototyping, but most people wanted it 
as part of the scheduler
[Mon Jul 13 18:22:28 2015] <wfarner>: rafik: it's definitely not a requirement. 
 even today, that slice of the scheduler is really  just asset serving + API 
calls, so it is technically splittable
[Mon Jul 13 18:22:48 2015] <rafik>: Maybe an in-between where you package the 
UI as a directory of assets but allow the directory to be overriden
[Mon Jul 13 18:22:53 2015] <jcohen>: one alternative might be to allow 
optionally serving up a separate UI from the default scheduler UI (or a command 
line flag to entirely suppress the scheduler serving a UI at all)
[Mon Jul 13 18:22:55 2015] <wfarner>: but +1 to fewer parts for default 
deployment
[Mon Jul 13 18:23:05 2015] <rafik>: Consul supports a `-ui-dir` argument to 
update/deploy the UI portions separately
[Mon Jul 13 18:23:34 2015] <rafik>: And I've found that to be quite useful in 
testing out changes
[Mon Jul 13 18:24:01 2015] <bbrazil>: we do similarly for prometheus itself for 
development
[Mon Jul 13 18:24:02 2015] <rafik>: See "Self-hosted Dashboard": 
https://www.consul.io/intro/getting-started/ui.html
[Mon Jul 13 18:24:13 2015] <jcohen>: seems reasonable
[Mon Jul 13 18:24:42 2015] <wfarner>: ah, yes - i would like that as well.  i 
did some work a while back to get us close to that - all assets now live in a 
single path on the classpath
[Mon Jul 13 18:25:02 2015] <zmanji>: I also would like that as well. It would 
allow for cluster operators to make zone specific ui changes if requried
[Mon Jul 13 18:25:17 2015] <wfarner>: the only kink is that they don't live 
update, but that shouldn't be too hard to resolve
[Mon Jul 13 18:26:34 2015] <wfarner>: sounds like there's continued interest 
here.  i suggest that jcohen and rafik continue offline, and jcohen starts a 
dev@ thread shortly after
[Mon Jul 13 18:26:48 2015] <jcohen>: sounds good to me
[Mon Jul 13 18:26:59 2015] <rafik>: Same
[Mon Jul 13 18:27:24 2015] <wfarner>: last call for topics, i'll otherwise 
close in 3 mins
## Ubiqutious services (AURORA-1075) ##
[Mon Jul 13 18:28:26 2015] <wfarner>: rafik: floor is yours
[Mon Jul 13 18:29:10 2015] <rafik>: I'm wondering if there's been any update on 
the proposal that Anindya has opened
[Mon Jul 13 18:29:29 2015] <jcohen>: AURORA-1075
[Mon Jul 13 18:29:31 2015] <rafik>: Link to proposal here: 
https://docs.google.com/document/d/12hr6GnUZU3mc7xsWRzMi3nQILGB-3vyUxvbG-6YmvdE/edit
[Mon Jul 13 18:30:10 2015] <rafik>: We're interested in supporting this for our 
installation
[Mon Jul 13 18:30:26 2015] <rafik>: But don't want to invest too much time in 
exploring it if there's independent momentum
[Mon Jul 13 18:30:53 2015] <rafik>: Also not sure that we have resources that 
could contribute it yet, especially as it seems to require quite a few changes 
across existing pieces
[Mon Jul 13 18:31:06 2015] <wfarner>: i have not heard of anyone writing code 
related to that proposal
[Mon Jul 13 18:31:33 2015] <zmanji>: the proposal also has some (unresolved?) 
issues in the coments
[Mon Jul 13 18:32:10 2015] <wfarner>: i will take a pass through the document 
today to see if i can help drive any of those to resolution.
[Mon Jul 13 18:32:24 2015] <rafik>: Thanks wfarner, that would be great
[Mon Jul 13 18:32:41 2015] <rafik>: <eom>
[Mon Jul 13 18:32:46 2015] <rafik>: One more topic:
## Resource allocation for cron jobs ##
[Mon Jul 13 18:33:11 2015] <rafik>: This came up on Thursday evening, but we 
didn't seem to get a complete answer
[Mon Jul 13 18:33:23 2015] <rafik>: What's the story re: reserving quota for 
cron jobs?
[Mon Jul 13 18:33:47 2015] <rafik>: We have quite a few cron jobs that run very 
infrequently and are currently reserving useful quota
[Mon Jul 13 18:34:03 2015] <wickman>: i'm with rafik: why?  imho i think quota 
should be deducted at runtime for all tasks, including those spawned by crons.
[Mon Jul 13 18:34:04 2015] <rafik>: Is there any interest in rethinking the 
quota reservation for cron jobs?
[Mon Jul 13 18:34:12 2015] <zmanji>: It used to be not this case
[Mon Jul 13 18:34:15 2015] <zmanji>: but maxim reverted it
[Mon Jul 13 18:34:21 2015] <wickman>: bring back PENDING: insufficient quota
[Mon Jul 13 18:34:34 2015] <wfarner>: that was never a thing
[Mon Jul 13 18:34:37 2015] <jcohen>: Cron jobs used to run regardless of quota, 
right?
[Mon Jul 13 18:34:50 2015] <wfarner>: quota checks have always been at job 
submission time
[Mon Jul 13 18:34:54 2015] <zmanji>: jcohen: yes and that was a hole because 
then a role could consume more production resources than its quota
[Mon Jul 13 18:34:59 2015] <jcohen>: right
[Mon Jul 13 18:35:01 2015] <rafik>: Right, but cron scheduling isn't job 
submission
[Mon Jul 13 18:35:12 2015] <rafik>: My understanding is that crons are 
scheduled and then the Aurora scheduler submits jobs for them later
[Mon Jul 13 18:35:23 2015] <rafik>: Shouldn't the quota check happen then? as 
if it were an ad-hoc job?
[Mon Jul 13 18:35:48 2015] <wfarner>: rafik: i would support that
[Mon Jul 13 18:36:15 2015] <wfarner>: though we don't really have a means to 
give feedback about that right now
[Mon Jul 13 18:36:16 2015] <rafik>: It sounds like that used to be existing 
behavior?
[Mon Jul 13 18:36:23 2015] <rafik>: Is it just a matter of reverting a change?
[Mon Jul 13 18:36:26 2015] <wfarner>: nope, that behavior never existed
[Mon Jul 13 18:36:48 2015] <rafik>: Re: feedback. You mean give feedback to 
operators that their jobs aren't running?
[Mon Jul 13 18:37:15 2015] <wfarner>: correct, we currently lack a "pending 
job" concept, only pending tasks
[Mon Jul 13 18:37:45 2015] <rafik>: Is that true? I see "Pending: Insufficient 
CPU", etc. for jobs all the time in the UI
[Mon Jul 13 18:37:53 2015] <bbrazil>: I'd argue that sort of feedback should be 
from your monitoring system in the first instance, though the UI should expose 
something about it
[Mon Jul 13 18:37:54 2015] <wfarner>: right - that's for tasks
[Mon Jul 13 18:38:26 2015] <rafik>: wfarner: not sure I understand the 
difference
[Mon Jul 13 18:38:30 2015] <wfarner>: say we were to launch cron jobs and hold 
its tasks in PENDING due to insufficient quota.  what does that mean for 
service tasks when they restart?
[Mon Jul 13 18:38:46 2015] <wfarner>: do service tasks also wait because a cron 
iteration is holding up resources?
[Mon Jul 13 18:39:13 2015] <rafik>: I would prefer they do, yes
[Mon Jul 13 18:39:19 2015] <bbrazil>: I'd expect quota to be at a job level, so 
tasks restarting/updating wouldn't be affected
[Mon Jul 13 18:39:20 2015] <rafik>: Assuming the crons are marked `production`, 
etc.
[Mon Jul 13 18:39:38 2015] <bbrazil>: a new job may not be accepted in that 
case, or updates changing resource usage
[Mon Jul 13 18:39:45 2015] <wickman>: wfarner: FIFO queue
[Mon Jul 13 18:39:48 2015] <wickman>: wfarner: of pending tasks
[Mon Jul 13 18:39:57 2015] <rafik>: This really depends on how people are using 
cron, but in our particular use case, we have cron jobs that are related to 
long-running services
[Mon Jul 13 18:40:11 2015] <rafik>: It's been brought up before, but some 
concept of job "groups" may actually go towards resolving this
[Mon Jul 13 18:40:23 2015] <rafik>: E.g. you can say my offline payments 
service has these 5 cron jobs that need to run
[Mon Jul 13 18:40:41 2015] <rafik>: In that case, Aurora could do something 
like reserve the maximum of the cron job resources
[Mon Jul 13 18:40:49 2015] <rafik>: And only allow one job to run at a time for 
instance
[Mon Jul 13 18:40:59 2015] <rafik>: Obviously not applicable in all 
circumstances
[Mon Jul 13 18:41:12 2015] <bbrazil>: I could see that for a dependency setup, 
not sure about the more general case you're proposing
[Mon Jul 13 18:41:14 2015] <wfarner>: yeah, and could make things even harder 
to reason about
[Mon Jul 13 18:41:18 2015] <rafik>: But some concept of pooling together crons 
so that they would share resources might be useful
[Mon Jul 13 18:41:47 2015] <wfarner>: rafik: i definitely agree that as a user 
i should be able to deliberately stagger my cron jobs to time-share quota
[Mon Jul 13 18:41:49 2015] <rafik>: Okay, perhaps better to table the job group 
discussion for now then
[Mon Jul 13 18:42:10 2015] <rafik>: Yeah, for background ~60% of our quota is 
reserved by cron jobs right now
[Mon Jul 13 18:42:18 2015] <rafik>: Most of which only run on the order of once 
a day, or once a week
[Mon Jul 13 18:42:47 2015] <rafik>: Aurora could make some attempt to reserve 
cron based on the job schedules
[Mon Jul 13 18:43:04 2015] <rafik>: I.e. recognize crons as being 
non-overlapping
[Mon Jul 13 18:43:13 2015] <rafik>: But that assumes some knowledge of their 
run-time, I suppose
[Mon Jul 13 18:43:21 2015] <wfarner>: right
[Mon Jul 13 18:43:38 2015] <bbrazil>: and gets more complicated if something 
else is using the resources it wants
[Mon Jul 13 18:43:46 2015] <rafik>: Right
[Mon Jul 13 18:43:53 2015] <wfarner>: IMHO a pending job submission is the 
easiest to think about from an operator and user perspective
[Mon Jul 13 18:44:04 2015] <bbrazil>: +1
[Mon Jul 13 18:44:05 2015] <wfarner>: (not to be confused with a pending task)
[Mon Jul 13 18:44:06 2015] <rafik>: +1
[Mon Jul 13 18:44:26 2015] <wickman>: really, why pending job?
[Mon Jul 13 18:44:46 2015] <wickman>: and not just PENDING: insufficient quota 
+ a FIFO queue
[Mon Jul 13 18:45:05 2015] <wickman>: are you worried about reduced 
availability of flapping service tasks?
[Mon Jul 13 18:45:10 2015] <wickman>: that's what priority is for
[Mon Jul 13 18:45:23 2015] <wfarner>: wickman: good point w.r.t. priority
[Mon Jul 13 18:46:04 2015] <bbrazil>: I think that job admission control should 
be separate from task scheduling and restart handling
[Mon Jul 13 18:46:12 2015] <wfarner>: checking the code, the scheduler does 
appropriately use priority within a role
[Mon Jul 13 18:47:46 2015] <wfarner>: another behavior supporting bbrazil's 
statement - the current approach is immune to a cluster administrator 
fat-fingering a user's quota
[Mon Jul 13 18:48:24 2015] <wfarner>: if quota is considered during task 
scheduling, there's a larger potential impact
[Mon Jul 13 18:48:47 2015] <wfarner>: though this could also be argued for 
checking quota while kicking off a cron run
[Mon Jul 13 18:48:48 2015] <bbrazil>: it'd also require providing more quota to 
be safe to handle task scheduling
[Mon Jul 13 18:49:24 2015] <wickman>: i'm of the opposite belief -- we should 
go even further and evaluate quota for running tasks every time quota changes
[Mon Jul 13 18:49:33 2015] <wickman>: in other words, reducing quota can 
actually preempt tasks and make them go PENDING: insufficient quota
[Mon Jul 13 18:49:51 2015] <bbrazil>: you want the quota given to directly be 
what you want the user to be able to use, if the administrator has to add 
safety/fudge factors that makes resource less manageable
[Mon Jul 13 18:51:00 2015] <wfarner>: we're running long on the meeting.  
rafik: i suggest you carry this discussion to dev@ so that we may continue 
offline
[Mon Jul 13 18:51:12 2015] <rafik>: Sure
[Mon Jul 13 18:51:26 2015] <wfarner>: closing up now, thanks for the 
interesting discussions, everyone!
[Mon Jul 13 18:51:29 2015] <wfarner>: ASFBot: meeting stop


Meeting ended at Mon Jul 13 18:51:29 2015

Reply via email to