durgaprasadml opened a new pull request, #38753:
URL: https://github.com/apache/beam/pull/38753

   ## Summary
   
   This PR stabilizes the PostCommit Java ValidatesRunner Dataflow Streaming 
workflow, which is currently failing more than 50% of the time due to 
infrastructure contention, legacy streaming runner limitations, and overly 
aggressive streaming job cancellation behavior.
   
   Fixes #38710
   
   ---
   
   ## Root Causes Identified
   
   ### 1. Unbounded Parallelism / Resource Exhaustion
   
   The validatesRunner tasks were configured with:
   
   groovy id="n9b5m2" maxParallelForks Integer.MAX_VALUE 
   
   Combined with GitHub Actions max-workers: 12, this could launch up to 12 
concurrent Dataflow streaming jobs simultaneously.
   
   This frequently exhausted:
   - Compute Engine IP quotas
   - CPUs
   - concurrent Dataflow job quotas
   - self-hosted runner resources
   
   leading to worker startup starvation and test timeouts.
   
   ---
   
   ### 2. Legacy Streaming Worker Non-Termination
   
   The workflow previously used the legacy VM-based streaming execution path:
   
   bash id="ewl1pu" 
:runners:google-cloud-dataflow-java:validatesRunnerStreaming 
   
   Bounded streaming pipelines under the legacy runner often failed to 
terminate automatically, remaining in RUNNING state until the 15-minute timeout 
cancelled them.
   
   ---
   
   ### 3. Aggressive Failure Cancellation
   
   TestDataflowRunner immediately cancelled jobs upon encountering any 
JOB_MESSAGE_ERROR, even for transient worker/network issues that Dataflow could 
automatically recover from.
   
   This caused false-negative failures in CI.
   
   ---
   
   ## Changes Implemented
   
   ### Throttle validatesRunner concurrency
   
   Reduced validatesRunner concurrency to:
   
   groovy id="pvb0u9" maxParallelForks = 4 
   
   with support for overriding via:
   
   bash id="6a4s1z" -PmaxParallelForks=<n> 
   
   This reduces quota pressure and runner overload.
   
   ---
   
   ### Migrate workflow to Streaming Engine
   
   Updated the workflow to run:
   
   bash id="0kcg4q" 
:runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine 
   
   Benefits:
   - faster startup
   - improved bounded-source termination
   - reduced infrastructure overhead
   - improved stability
   
   ---
   
   ### Add metrics-driven early termination
   
   Enhanced TestDataflowRunner to continuously poll:
   - PAssertSuccess
   - PAssertFailure
   
   during streaming execution.
   
   Behavior:
   - early cancel on assertion success
   - early cancel on assertion failure
   
   This reduces successful test runtime from ~15 minutes to ~2–3 minutes.
   
   ---
   
   ### Delay cancellation on transient worker errors
   
   Added a recovery window before cancelling jobs due to transient 
JOB_MESSAGE_ERROR entries, allowing Dataflow retries and self-healing to 
stabilize the pipeline.
   
   ---
   
   ### Add Gradle test retry support
   
   Integrated the org.gradle.test-retry plugin for CI integration tests to 
reduce transient infrastructure-related failures.
   
   ---
   
   ## Validation
   
   Added/updated tests covering:
   - streaming early-success termination
   - streaming early-failure termination
   - metrics polling behavior
   
   Verification command:
   
   bash id="j7bw3w" ./gradlew :runners:google-cloud-dataflow-java:test \   
--tests "org.apache.beam.runners.dataflow.TestDataflowRunnerTest" 
   
   Streaming validation command:
   
   bash id="o3llzt" ./gradlew 
:runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine \   
-PtestFilter="org.apache.beam.sdk.transforms.GroupByKeyTest" 
   
   ---
   
   ## Expected Impact
   
   These changes are expected to:
   - significantly reduce CI flakiness
   - reduce GCP quota pressure
   - improve workflow runtime stability
   - shorten streaming test execution time
   - improve overall validatesRunner reliability


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to