Re: Adding the limited resource to TaskStatus messages

2017-10-10 Thread James Peach

> On Oct 9, 2017, at 7:15 PM, Wil Yegelwel  wrote:
> 
> Is it correct to say that the limited resource field is *only* meant to 
> provide machine readable information about what resources limits were 
> exceeded?

Yes,

> If so, does it make sense to provide richer reporting fields for all failure 
> reasons? I imagine other failure reasons could benefit from being able to 
> report details of the failure that are machine readable.

Some other reasons already have their own structured information, eg. the 
TASK_UNREACHABLE state populates the `unreachable_time` field. I'm not planning 
to add structured information to any other failure reasons, but I'd support 
doing it if you have a specific suggestion.

> On Mon, Oct 9, 2017, 3:50 PM James Peach  wrote:
> 
> > On Oct 9, 2017, at 1:27 PM, Vinod Kone  wrote:
> >
> >> In the case that a task is killed because it violated a resource
> >> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> >> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> >> this field may be populated with the resource that triggered the
> >> limitation. This is intended to give better information to schedulers about
> >> task resource failures, in the expectation that it will help them bubble
> >> useful information up to the user or a monitoring system.
> >>
> >
> > Can you elaborate what schedulers are expected to do with this information?
> > Looking for some concrete use cases if you can.
> 
> There's no concrete use case here; it's just a matter of propagating 
> information we know in a structured way.
> 
> If we assume that the scheduler knows about some sort of monitoring system or 
> has a UI, we can present this to the user or a system that can take action on 
> it. The status quo is that the raw message string is dumped to logs, and has 
> to be manually interpreted.
> 
> Additionally, this can pave the way to getting rid of 
> REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY. All 
> you really need is REASON_CONTAINER_LIMITATION plus the resource information.
> 
> J
> 



Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread Wil Yegelwel
Is it correct to say that the limited resource field is *only* meant to
provide machine readable information about what resources limits were
exceeded?

If so, does it make sense to provide richer reporting fields for all
failure reasons? I imagine other failure reasons could benefit from being
able to report details of the failure that are machine readable.

On Mon, Oct 9, 2017, 3:50 PM James Peach  wrote:

>
> > On Oct 9, 2017, at 1:27 PM, Vinod Kone  wrote:
> >
> >> In the case that a task is killed because it violated a resource
> >> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> >> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> >> this field may be populated with the resource that triggered the
> >> limitation. This is intended to give better information to schedulers
> about
> >> task resource failures, in the expectation that it will help them bubble
> >> useful information up to the user or a monitoring system.
> >>
> >
> > Can you elaborate what schedulers are expected to do with this
> information?
> > Looking for some concrete use cases if you can.
>
> There's no concrete use case here; it's just a matter of propagating
> information we know in a structured way.
>
> If we assume that the scheduler knows about some sort of monitoring system
> or has a UI, we can present this to the user or a system that can take
> action on it. The status quo is that the raw message string is dumped to
> logs, and has to be manually interpreted.
>
> Additionally, this can pave the way to getting rid of
> REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY.
> All you really need is REASON_CONTAINER_LIMITATION plus the resource
> information.
>
> J
>
>


Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread James Peach

> On Oct 9, 2017, at 1:27 PM, Vinod Kone  wrote:
> 
>> In the case that a task is killed because it violated a resource
>> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
>> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
>> this field may be populated with the resource that triggered the
>> limitation. This is intended to give better information to schedulers about
>> task resource failures, in the expectation that it will help them bubble
>> useful information up to the user or a monitoring system.
>> 
> 
> Can you elaborate what schedulers are expected to do with this information?
> Looking for some concrete use cases if you can.

There's no concrete use case here; it's just a matter of propagating 
information we know in a structured way.

If we assume that the scheduler knows about some sort of monitoring system or 
has a UI, we can present this to the user or a system that can take action on 
it. The status quo is that the raw message string is dumped to logs, and has to 
be manually interpreted. 

Additionally, this can pave the way to getting rid of 
REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY. All 
you really need is REASON_CONTAINER_LIMITATION plus the resource information.

J



Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread Vinod Kone
> In the case that a task is killed because it violated a resource
> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> this field may be populated with the resource that triggered the
> limitation. This is intended to give better information to schedulers about
> task resource failures, in the expectation that it will help them bubble
> useful information up to the user or a monitoring system.
>

Can you elaborate what schedulers are expected to do with this information?
Looking for some concrete use cases if you can.


Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread Jie Yu
+1

On Mon, Oct 9, 2017 at 10:56 AM, James Peach  wrote:

> Hi all,
>
> In https://reviews.apache.org/r/62644/, I am proposing to add an optional
> Resources field to the TaskStatus message named `limited_resources`.
>
> In the case that a task is killed because it violated a resource
> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> this field may be populated with the resource that triggered the
> limitation. This is intended to give better information to schedulers about
> task resource failures, in the expectation that it will help them bubble
> useful information up to the user or a monitoring system.
>
> diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
> index d742adbbf..559d09e37 100644
> --- a/include/mesos/v1/mesos.proto
> +++ b/include/mesos/v1/mesos.proto
> @@ -2252,6 +2252,13 @@ message TaskStatus {
>// status updates for tasks running on agents that are unreachable
>// (e.g., partitioned away from the master).
>optional TimeInfo unreachable_time = 14;
> +
> +  // If the reason field indicates a container resource limitation,
> +  // this field contains the resource whose limits were violated.
> +  //
> +  // NOTE: 'Resources' is used here because the resource may span
> +  // multiple roles (e.g. `"mem(*):1;mem(role):2"`).
> +  repeated Resource limited_resources = 16;
>  }
>
>
>
> cheers,
> James
>
>
>


Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread Yan Xu
Does it make sense to wrap the resources in a `Limitation` message in case
we add new fields for it?

---
Jiang Yan Xu  | @xujyan 

On Mon, Oct 9, 2017 at 10:56 AM, James Peach  wrote:

> Hi all,
>
> In https://reviews.apache.org/r/62644/, I am proposing to add an optional
> Resources field to the TaskStatus message named `limited_resources`.
>
> In the case that a task is killed because it violated a resource
> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> this field may be populated with the resource that triggered the
> limitation. This is intended to give better information to schedulers about
> task resource failures, in the expectation that it will help them bubble
> useful information up to the user or a monitoring system.
>
> diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
> index d742adbbf..559d09e37 100644
> --- a/include/mesos/v1/mesos.proto
> +++ b/include/mesos/v1/mesos.proto
> @@ -2252,6 +2252,13 @@ message TaskStatus {
>// status updates for tasks running on agents that are unreachable
>// (e.g., partitioned away from the master).
>optional TimeInfo unreachable_time = 14;
> +
> +  // If the reason field indicates a container resource limitation,
> +  // this field contains the resource whose limits were violated.
> +  //
> +  // NOTE: 'Resources' is used here because the resource may span
> +  // multiple roles (e.g. `"mem(*):1;mem(role):2"`).
> +  repeated Resource limited_resources = 16;
>  }
>
>
>
> cheers,
> James
>
>
>


Adding the limited resource to TaskStatus messages

2017-10-09 Thread James Peach
Hi all,

In https://reviews.apache.org/r/62644/, I am proposing to add an optional 
Resources field to the TaskStatus message named `limited_resources`.

In the case that a task is killed because it violated a resource constraint 
(ie. the reason field is REASON_CONTAINER_LIMITATION, 
REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY), this 
field may be populated with the resource that triggered the limitation. This is 
intended to give better information to schedulers about task resource failures, 
in the expectation that it will help them bubble useful information up to the 
user or a monitoring system.

diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
index d742adbbf..559d09e37 100644
--- a/include/mesos/v1/mesos.proto
+++ b/include/mesos/v1/mesos.proto
@@ -2252,6 +2252,13 @@ message TaskStatus {
   // status updates for tasks running on agents that are unreachable
   // (e.g., partitioned away from the master).
   optional TimeInfo unreachable_time = 14;
+
+  // If the reason field indicates a container resource limitation,
+  // this field contains the resource whose limits were violated.
+  //
+  // NOTE: 'Resources' is used here because the resource may span
+  // multiple roles (e.g. `"mem(*):1;mem(role):2"`).
+  repeated Resource limited_resources = 16;
 }



cheers,
James