[ 
https://issues.apache.org/jira/browse/BEAM-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Wasilewski closed BEAM-6081.
----------------------------------
    Fix Version/s: Not applicable
       Resolution: Fixed

> Create "Dataflow Reaper" infrastructure to periodically clean up stuck 
> Dataflow jobs
> ------------------------------------------------------------------------------------
>
>                 Key: BEAM-6081
>                 URL: https://issues.apache.org/jira/browse/BEAM-6081
>             Project: Beam
>          Issue Type: New Feature
>          Components: build-system, testing
>            Reporter: Scott Wegner
>            Assignee: Alan Myrvold
>            Priority: Minor
>             Fix For: Not applicable
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Our Jenkins infrastructure continuously runs many Dataflow jobs as part of 
> pre- and post-commit tests. These are scheduled against our shared 
> {{apache-beam-testing}} project, which has some amount of GCP quota for these 
> jobs.
> Some bugs can cause Dataflow jobs to get stuck and hang indefinitely. This 
> causes many test jobs to stack up, which eats up our GCP quota and then 
> causes all subsequent jobs to fail for quota issues. For an example, see 
> [[BEAM-6080]|https://issues.apache.org/jira/browse/BEAM-6080].
> We should harden the Dataflow runner and test framework to prevent Dataflow 
> jobs getting stuck indefinitely, but in reality: bugs happen.
> We should add some "reaper" process to periodically query for long-running 
> jobs on our Dataflow project and cancel them. This would be fairly 
> straight-forward using the [Dataflow REST 
> API|https://cloud.google.com/dataflow/docs/reference/rest/], and scheduled on 
> Jenkins.
> If we build such a mechanism, we should also document the imposed policy 
> (i.e. the threshold for "long running jobs"), and perhaps some mechanism for 
> opting out. For example, performance benchmarking jobs might be long-running 
> by design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to