[
https://issues.apache.org/jira/browse/BEAM-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Liu updated BEAM-5108:
---------------------------
Description:
Recently, few Python streaming pipelines on Dataflow apache-beam-testing
project run for more than 5 days. This look like a leaking from Jenkins job
that runs e2e integration tests.
Test framework has a pipeline resource clean up and applies to all integration
test, which is defined in
[TestDataflowRunner|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py#L67].
However, the cancellation may failed in a special case, like following (from
[this Jenkins
run|https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/5636/consoleFull]):
{quote}
Workflow modification failed. Causes: (c53cc746f7bc7f49): Operation cancel not
allowed for job 2018-08-01_13_10_24-5019826606522054507. Job is not yet ready
for canceling. Please retry in a few minutes.
{quote}
Two possible approaches to improve:
1. Add retry to the framework cancellation.
2. Instead of wait until pipeline in RUNNING state
([here|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py#L57]),
we want to wait more to make sure worker pool starts successfully.
was:
Recently, few Python streaming pipelines on Dataflow apache-beam-testing
project run for more than 5 days. This look like a leaking from Jenkins job
that runs e2e integration tests.
Test framework has a pipeline resource clean up and applies to all integration
test, which is defined in
[TestDataflowRunner|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py#L67].
However, the cancellation may failed in a special case, like following (from
[this Jenkins
run|https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/5636/consoleFull]):
{quote}
Workflow modification failed. Causes: (c53cc746f7bc7f49): Operation cancel not
allowed for job 2018-08-01_13_10_24-5019826606522054507. Job is not yet ready
for canceling. Please retry in a few minutes.
{quote}
Two possible approaches to improve test infra:
1. Add retry to the framework cancellation.
2. Instead of wait until pipeline in RUNNING state
([here|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py#L57]),
we want to wait more to make sure worker pool starts successfully.
> Improve Python test framework to prevent streaming pipeline leaks
> -----------------------------------------------------------------
>
> Key: BEAM-5108
> URL: https://issues.apache.org/jira/browse/BEAM-5108
> Project: Beam
> Issue Type: Task
> Components: testing
> Reporter: Mark Liu
> Priority: Major
>
> Recently, few Python streaming pipelines on Dataflow apache-beam-testing
> project run for more than 5 days. This look like a leaking from Jenkins job
> that runs e2e integration tests.
> Test framework has a pipeline resource clean up and applies to all
> integration test, which is defined in
> [TestDataflowRunner|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py#L67].
> However, the cancellation may failed in a special case, like following (from
> [this Jenkins
> run|https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/5636/consoleFull]):
> {quote}
> Workflow modification failed. Causes: (c53cc746f7bc7f49): Operation cancel
> not allowed for job 2018-08-01_13_10_24-5019826606522054507. Job is not yet
> ready for canceling. Please retry in a few minutes.
> {quote}
> Two possible approaches to improve:
> 1. Add retry to the framework cancellation.
> 2. Instead of wait until pipeline in RUNNING state
> ([here|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py#L57]),
> we want to wait more to make sure worker pool starts successfully.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)