[ https://issues.apache.org/jira/browse/BEAM-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kamil Wasilewski closed BEAM-6081. ---------------------------------- Fix Version/s: Not applicable Resolution: Fixed > Create "Dataflow Reaper" infrastructure to periodically clean up stuck > Dataflow jobs > ------------------------------------------------------------------------------------ > > Key: BEAM-6081 > URL: https://issues.apache.org/jira/browse/BEAM-6081 > Project: Beam > Issue Type: New Feature > Components: build-system, testing > Reporter: Scott Wegner > Assignee: Alan Myrvold > Priority: Minor > Fix For: Not applicable > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Our Jenkins infrastructure continuously runs many Dataflow jobs as part of > pre- and post-commit tests. These are scheduled against our shared > {{apache-beam-testing}} project, which has some amount of GCP quota for these > jobs. > Some bugs can cause Dataflow jobs to get stuck and hang indefinitely. This > causes many test jobs to stack up, which eats up our GCP quota and then > causes all subsequent jobs to fail for quota issues. For an example, see > [[BEAM-6080]|https://issues.apache.org/jira/browse/BEAM-6080]. > We should harden the Dataflow runner and test framework to prevent Dataflow > jobs getting stuck indefinitely, but in reality: bugs happen. > We should add some "reaper" process to periodically query for long-running > jobs on our Dataflow project and cancel them. This would be fairly > straight-forward using the [Dataflow REST > API|https://cloud.google.com/dataflow/docs/reference/rest/], and scheduled on > Jenkins. > If we build such a mechanism, we should also document the imposed policy > (i.e. the threshold for "long running jobs"), and perhaps some mechanism for > opting out. For example, performance benchmarking jobs might be long-running > by design. -- This message was sent by Atlassian Jira (v8.3.4#803005)