Re: API review: max_duration on TaskInfo

Zhitao Li Mon, 26 Mar 2018 09:52:58 -0700

Hi Benjamin,

James and I did some quick search about some existing systems. We can dig
deep into their semantic.

Kubernetes has a feature called activeDeadlineSeconds
<https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/>,
although it seems to be total scheduling time rather than container run
time (which makes sense since K8s itself is end to end scheduler).

The BMC server Automation system has something similar to above called
JOB_TIMEOUT
<https://docs.bmc.com/docs/ServerAutomation/86/using/managing-jobs/defining-timeouts-for-jobs>
.

YARN/Hadoop defined a couple of configurations suffixed with timeout
(`mapreduce.task.timeout` and related ones in this doc
<https://hadoop.apache.org/docs/r2.4.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml>)
although they also seem to model some aspects of "health check". There is
also some research work about deadline scheduler
<https://www.researchgate.net/publication/267752723_Hadoop_Scheduler_with_Deadline_Constraint>
in Hadoop but I have not realized whether the work is translated to open
source implementation.

Wrapping a `timeout` command is definitely one possibility, but it seems a
bit hacky to me and also lacked proper reporting and tracking. If we could
support this feature w/o too much complexity I think it's still attractive.

Please let me know your opinion. Thanks.

On Fri, Mar 23, 2018 at 3:33 PM, Benjamin Mahler <bmah...@apache.org> wrote:

> In the interest of doing our due diligence, have you studied any prior art?
>
> For example, I was surprised to notice that htcondor doesn't really provide
> this as a first class thing:
> https://lists.cs.wisc.edu/archive/htcondor-users/2006-
> November/msg00024.shtml
>
> I didn't see it in any other systems I looked at either, with people
> suggesting wrapping commands with the 'timeout' command. I suspect most
> systems have the user do this on their own with a simple timeout wrapper
> script?
>
> On Fri, Mar 23, 2018 at 2:21 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I'd like to do an API review for MESOS-8725
> > <https://issues.apache.org/jira/browse/MESOS-8725>. We are adding an
> > optional `max_duration` to `TaskInfo` field. If a task does not terminate
> > within this duration, built-in executors will kill the task with a new
> > reason `REASON_MAX_DURATION_REACHED`.
> >
> > Proof of concept patch:
> > https://reviews.apache.org/r/66258/
> >
> > Reference implementation in command executor:
> > https://reviews.apache.org/r/66259/
> >
> > A design choice we made is to make this relative duration rather than an
> > absolute timestamp of deadline. Our rationales:
> >
> >    - Cluster could suffer from clock skews, so same absolute deadline
> would
> >    result in inconsistent behavior;
> >    - Framework can just trivially translate its own clock as source of
> >    truth to translate absolute deadline to current time + max_duration.
> >
> > Please let me know what you think. Thanks.
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>

-- 
Cheers,

Zhitao Li

Re: API review: max_duration on TaskInfo

Reply via email to