[ 
https://issues.apache.org/jira/browse/TEZ-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773448#comment-17773448
 ] 

Mudit Sharma commented on TEZ-4518:
-----------------------------------

[~rajesh.balamohan] we tried to evaluate running a watcher service to get spill 
counters and kill apps, we found some concerns with that approach:
 # Service needs to pull task level counters instead of overall app level 
counters and in most of our cases we have thousands of tasks, so running poll 
on all those tasks will become a bottleneck
 # I went through the code and I found apart from sorters, the only other place 
where I saw num spills getting incremented was UnorderedPartitionedKVWriter, 
which was during merge operation as you suggested. Also going by the code 
looked like sort spills will be more as compared to any other spills, you can 
correct me here. So, if we can do for sort tasks, atleast it will help us doing 
task level monitoring from tez while overall app counters based monitoring we 
are anyways doing via our watcher service

Please let us know your thoughts on it

> Limit number of spill files getting created
> -------------------------------------------
>
>                 Key: TEZ-4518
>                 URL: https://issues.apache.org/jira/browse/TEZ-4518
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Mudit Sharma
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hi,
>  
> We have been facing some issues where many of our cluster node disks go full 
> because of some rogue applications creating a lot of spill data
> We wanted to fail the app if more than a threshold amount of spill files are 
> written
> Please let us know if any such capability is supported
>  
> If the capability is not there, we are proposing it to support it via a 
> config, we have added a PR for the same: 
> https://github.com/apache/tez/pull/312, please let us know your thoughts on it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to