[ https://issues.apache.org/jira/browse/OOZIE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139486#comment-14139486 ]
Purshotam Shah commented on OOZIE-1813: --------------------------------------- {quote} Tests failed: 1 . Tests errors: 0 . The patch failed the following testcases: . testMessage_withMixedStatus(org.apache.oozie.command.coord.TestAbandonedCoordChecker) {quote} Not sure why this has failed in pre-commit. I tried running whole testcase multiple times in my local box, no failure. Uploaded patch to re trigger pre-commit build. > Add service to report/kill rogue bundles and coordinator jobs > ------------------------------------------------------------- > > Key: OOZIE-1813 > URL: https://issues.apache.org/jira/browse/OOZIE-1813 > Project: Oozie > Issue Type: Bug > Reporter: Purshotam Shah > Assignee: Purshotam Shah > Fix For: trunk > > Attachments: OOZIE-1813-Amendment-V1.patch, > OOZIE-1813-Amendment-V1.patch, OOZIE-1813-V2.patch, OOZIE-1813-V3.patch, > OOZIE-1813-V4.patch, OOZIE-1813-V5.patch, OOZIE-1813-V6.patch, > OOZIE-1813-V7.patch, OOZIE-1813-V8.patch > > > People leave their test coordinator and bundle jobs without ever killing them > and they just eat up resources heavily. We should have a service which > periodically check for abandoned coords and report/kill them. > We can add multiple logic to this like ( number of consecutive > failed/timedout action, total number of failed/timedout action). > To start with if number of coord action with failed/timedout status > defined > value, then coord is considered to be rogue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)