[ 
https://issues.apache.org/jira/browse/TEZ-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780167#comment-17780167
 ] 

Mudit Sharma edited comment on TEZ-4518 at 10/27/23 4:47 AM:
-------------------------------------------------------------

[~ayushtkn] , thanks for the comments. Actually, we just have this requirement: 
There are some buggy Tez Apps, where in some of the containers, a lot of data 
is spilled to the nodes of the cluster and nodes go down, there by impacting 
all other apps running on that node. We wanted to just kill those rogue 
containers so that they don't end up causing issues to other apps running on 
that node

 

[~rajesh.balamohan] suggested to have an app level monitoring using counters, 
that we have started implementing and it is working fine but there could be 
some apps where overall spill is way higher, but it is very well distributed 
across tasks, so we do not want to label them as rogue, on the other hand, 
there could be some apps which have overall spill very less but some rogue 
containers having a very high spill skew

 

And doing a container level monitoring from an external watcher does not scale 
as we have 1000s of tasks in an app and at a time 1000s of apps running

 

So, I discussed with [~rajesh.balamohan] also above, that since in most of the 
cases, spill ratios are same and most of the spill was happening in sorter, 
then this can control the rogue disk spill tasks. I am open for any other 
approaches which might be beneficial here if this looks not very correct

Or if there is any other way I can kill rogue disk spill tasks in Tez, that 
would also helo


was (Author: mudit-97):
[~ayushtkn] , thanks for the comments. Actually, we just have this requirement: 
There are some buggy Tez Apps, where in some of the containers, a lot of data 
is spilled to the nodes of the cluster and nodes go down, there by impacting 
all other apps running on that node. We wanted to just kill those rogue 
containers so that they don't end up causing issues to other apps running on 
that node

 

[~rajesh.balamohan] suggested to have an app level monitoring using counters, 
that we have started implementing and it is working fine but there could be 
some apps where overall spill is way higher, but it is very well distributed 
across tasks, so we do not want to label them as rogue, on the other hand, 
there could be some apps which have overall spill very less but some rogue 
containers having a very high spill skew

 

And doing a container level monitoring from an external watcher does not scale 
as we have 1000s of tasks and at a time 1000s of apps running

 

So, I discussed with [~rajesh.balamohan] also above, that since in most of the 
cases, spill ratios are same and most of the spill was happening in sorter, 
then this can control the rogue disk spill tasks. I am open for any other 
approaches which might be beneficial here if this looks not very correct

Or if there is any other way I can kill rogue disk spill tasks in Tez, that 
would also helo

> Limit number of spill files getting created
> -------------------------------------------
>
>                 Key: TEZ-4518
>                 URL: https://issues.apache.org/jira/browse/TEZ-4518
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Mudit Sharma
>            Priority: Critical
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Hi,
>  
> We have been facing some issues where many of our cluster node disks go full 
> because of some rogue applications creating a lot of spill data
> We wanted to fail the app if more than a threshold amount of spill files are 
> written
> Please let us know if any such capability is supported
>  
> If the capability is not there, we are proposing it to support it via a 
> config, we have added a PR for the same: 
> https://github.com/apache/tez/pull/312, please let us know your thoughts on it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to