A quick update: James and I think the name `max_completion_time` is a bit better than `max_duration`. Semantic should remain the same.
On Mon, Mar 26, 2018 at 9:52 AM, Zhitao Li <[email protected]> wrote: > Hi Benjamin, > > James and I did some quick search about some existing systems. We can dig > deep into their semantic. > > Kubernetes has a feature called activeDeadlineSeconds > <https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/>, > although it seems to be total scheduling time rather than container run > time (which makes sense since K8s itself is end to end scheduler). > > The BMC server Automation system has something similar to above called > JOB_TIMEOUT > <https://docs.bmc.com/docs/ServerAutomation/86/using/managing-jobs/defining-timeouts-for-jobs> > . > > YARN/Hadoop defined a couple of configurations suffixed with timeout > (`mapreduce.task.timeout` and related ones in this doc > <https://hadoop.apache.org/docs/r2.4.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml>) > although they also seem to model some aspects of "health check". There is > also some research work about deadline scheduler > <https://www.researchgate.net/publication/267752723_Hadoop_Scheduler_with_Deadline_Constraint> > in Hadoop but I have not realized whether the work is translated to open > source implementation. > > Wrapping a `timeout` command is definitely one possibility, but it seems a > bit hacky to me and also lacked proper reporting and tracking. If we could > support this feature w/o too much complexity I think it's still attractive. > > Please let me know your opinion. Thanks. > > On Fri, Mar 23, 2018 at 3:33 PM, Benjamin Mahler <[email protected]> > wrote: > >> In the interest of doing our due diligence, have you studied any prior >> art? >> >> For example, I was surprised to notice that htcondor doesn't really >> provide >> this as a first class thing: >> https://lists.cs.wisc.edu/archive/htcondor-users/2006- >> November/msg00024.shtml >> >> I didn't see it in any other systems I looked at either, with people >> suggesting wrapping commands with the 'timeout' command. I suspect most >> systems have the user do this on their own with a simple timeout wrapper >> script? >> >> On Fri, Mar 23, 2018 at 2:21 PM, Zhitao Li <[email protected]> wrote: >> >> > Hi everyone, >> > >> > I'd like to do an API review for MESOS-8725 >> > <https://issues.apache.org/jira/browse/MESOS-8725>. We are adding an >> > optional `max_duration` to `TaskInfo` field. If a task does not >> terminate >> > within this duration, built-in executors will kill the task with a new >> > reason `REASON_MAX_DURATION_REACHED`. >> > >> > Proof of concept patch: >> > https://reviews.apache.org/r/66258/ >> > >> > Reference implementation in command executor: >> > https://reviews.apache.org/r/66259/ >> > >> > A design choice we made is to make this relative duration rather than an >> > absolute timestamp of deadline. Our rationales: >> > >> > - Cluster could suffer from clock skews, so same absolute deadline >> would >> > result in inconsistent behavior; >> > - Framework can just trivially translate its own clock as source of >> > truth to translate absolute deadline to current time + max_duration. >> > >> > Please let me know what you think. Thanks. >> > >> > -- >> > Cheers, >> > >> > Zhitao Li >> > >> > > > > -- > Cheers, > > Zhitao Li > -- Cheers, Zhitao Li
