[ https://issues.apache.org/jira/browse/TEZ-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780167#comment-17780167 ]
Mudit Sharma edited comment on TEZ-4518 at 10/27/23 4:47 AM: ------------------------------------------------------------- [~ayushtkn] , thanks for the comments. Actually, we just have this requirement: There are some buggy Tez Apps, where in some of the containers, a lot of data is spilled to the nodes of the cluster and nodes go down, there by impacting all other apps running on that node. We wanted to just kill those rogue containers so that they don't end up causing issues to other apps running on that node [~rajesh.balamohan] suggested to have an app level monitoring using counters, that we have started implementing and it is working fine but there could be some apps where overall spill is way higher, but it is very well distributed across tasks, so we do not want to label them as rogue, on the other hand, there could be some apps which have overall spill very less but some rogue containers having a very high spill skew And doing a container level monitoring from an external watcher does not scale as we have 1000s of tasks in an app and at a time 1000s of apps running So, I discussed with [~rajesh.balamohan] also above, that since in most of the cases, spill ratios are same and most of the spill was happening in sorter, then this can control the rogue disk spill tasks. I am open for any other approaches which might be beneficial here if this looks not very correct Or if there is any other way I can kill rogue disk spill tasks in Tez, that would also helo was (Author: mudit-97): [~ayushtkn] , thanks for the comments. Actually, we just have this requirement: There are some buggy Tez Apps, where in some of the containers, a lot of data is spilled to the nodes of the cluster and nodes go down, there by impacting all other apps running on that node. We wanted to just kill those rogue containers so that they don't end up causing issues to other apps running on that node [~rajesh.balamohan] suggested to have an app level monitoring using counters, that we have started implementing and it is working fine but there could be some apps where overall spill is way higher, but it is very well distributed across tasks, so we do not want to label them as rogue, on the other hand, there could be some apps which have overall spill very less but some rogue containers having a very high spill skew And doing a container level monitoring from an external watcher does not scale as we have 1000s of tasks and at a time 1000s of apps running So, I discussed with [~rajesh.balamohan] also above, that since in most of the cases, spill ratios are same and most of the spill was happening in sorter, then this can control the rogue disk spill tasks. I am open for any other approaches which might be beneficial here if this looks not very correct Or if there is any other way I can kill rogue disk spill tasks in Tez, that would also helo > Limit number of spill files getting created > ------------------------------------------- > > Key: TEZ-4518 > URL: https://issues.apache.org/jira/browse/TEZ-4518 > Project: Apache Tez > Issue Type: Improvement > Reporter: Mudit Sharma > Priority: Critical > Time Spent: 1h 20m > Remaining Estimate: 0h > > Hi, > > We have been facing some issues where many of our cluster node disks go full > because of some rogue applications creating a lot of spill data > We wanted to fail the app if more than a threshold amount of spill files are > written > Please let us know if any such capability is supported > > If the capability is not there, we are proposing it to support it via a > config, we have added a PR for the same: > https://github.com/apache/tez/pull/312, please let us know your thoughts on it -- This message was sent by Atlassian Jira (v8.20.10#820010)