[
https://issues.apache.org/jira/browse/TEZ-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773448#comment-17773448
]
Mudit Sharma commented on TEZ-4518:
-----------------------------------
[~rajesh.balamohan] we tried to evaluate running a watcher service to get spill
counters and kill apps, we found some concerns with that approach:
# Service needs to pull task level counters instead of overall app level
counters and in most of our cases we have thousands of tasks, so running poll
on all those tasks will become a bottleneck
# I went through the code and I found apart from sorters, the only other place
where I saw num spills getting incremented was UnorderedPartitionedKVWriter,
which was during merge operation as you suggested. Also going by the code
looked like sort spills will be more as compared to any other spills, you can
correct me here. So, if we can do for sort tasks, atleast it will help us doing
task level monitoring from tez while overall app counters based monitoring we
are anyways doing via our watcher service
Please let us know your thoughts on it
> Limit number of spill files getting created
> -------------------------------------------
>
> Key: TEZ-4518
> URL: https://issues.apache.org/jira/browse/TEZ-4518
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Mudit Sharma
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Hi,
>
> We have been facing some issues where many of our cluster node disks go full
> because of some rogue applications creating a lot of spill data
> We wanted to fail the app if more than a threshold amount of spill files are
> written
> Please let us know if any such capability is supported
>
> If the capability is not there, we are proposing it to support it via a
> config, we have added a PR for the same:
> https://github.com/apache/tez/pull/312, please let us know your thoughts on it
--
This message was sent by Atlassian Jira
(v8.20.10#820010)