[ https://issues.apache.org/jira/browse/BEAM-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090667#comment-17090667 ]
Kamil Wasilewski commented on BEAM-6081: ---------------------------------------- We have a Jenkins job running periodically that cancels stale Dataflow jobs (older than 3 hours): [https://builds.apache.org/job/beam_CancelStaleDataflowJobs/.|https://builds.apache.org/job/beam_CancelStaleDataflowJobs/] I think we can consider it done. > Create "Dataflow Reaper" infrastructure to periodically clean up stuck > Dataflow jobs > ------------------------------------------------------------------------------------ > > Key: BEAM-6081 > URL: https://issues.apache.org/jira/browse/BEAM-6081 > Project: Beam > Issue Type: New Feature > Components: build-system, testing > Reporter: Scott Wegner > Assignee: Alan Myrvold > Priority: Minor > Time Spent: 2h 20m > Remaining Estimate: 0h > > Our Jenkins infrastructure continuously runs many Dataflow jobs as part of > pre- and post-commit tests. These are scheduled against our shared > {{apache-beam-testing}} project, which has some amount of GCP quota for these > jobs. > Some bugs can cause Dataflow jobs to get stuck and hang indefinitely. This > causes many test jobs to stack up, which eats up our GCP quota and then > causes all subsequent jobs to fail for quota issues. For an example, see > [[BEAM-6080]|https://issues.apache.org/jira/browse/BEAM-6080]. > We should harden the Dataflow runner and test framework to prevent Dataflow > jobs getting stuck indefinitely, but in reality: bugs happen. > We should add some "reaper" process to periodically query for long-running > jobs on our Dataflow project and cancel them. This would be fairly > straight-forward using the [Dataflow REST > API|https://cloud.google.com/dataflow/docs/reference/rest/], and scheduled on > Jenkins. > If we build such a mechanism, we should also document the imposed policy > (i.e. the threshold for "long running jobs"), and perhaps some mechanism for > opting out. For example, performance benchmarking jobs might be long-running > by design. -- This message was sent by Atlassian Jira (v8.3.4#803005)