dkranchii opened a new pull request, #55554:
URL: https://github.com/apache/spark/pull/55554

   ### What changes were proposed in this pull request?
   
   `StronglyConnectedComponents.run` mirrors the existing `prevSccGraph` 
unpersist lifecycle for `sccWorkGraph`. Each of the two `sccWorkGraph` 
reassignments inside the inner `do/while` loop (the `outerJoinVertices` step 
and the `subgraph` step) now materializes the new generation via 
`vertices.count()` / `edges.count()` and then unpersists the previous 
generation, using the same idiom already in place for `sccGraph`.
   
   Tracking issue: https://github.com/apache/spark/issues/55553
   
   ### Why are the changes needed?
   
   The two `sccWorkGraph = ... .cache()` reassignments in the inner loop each 
leak a cached `Graph` generation per iteration. SPARK-16478 (2016) added the 
equivalent unpersist lifecycle for `sccGraph` but did not extend it to 
`sccWorkGraph`, leaving an asymmetric cache leak in the public SCC 
implementation:
   
   ```
   grep -n unpersist 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
   # Only references prevSccGraph; no sccWorkGraph.unpersist() exists prior to 
this PR.
   ```
   
   On graphs that take many inner iterations to converge (large SCCs, high 
diameter), the accumulated cached state evicts useful blocks under 
`MEMORY_AND_DISK`, triggers recomputation cascades through the join lineage, 
and ultimately produces executor memory pressure that presents as "SCC slows 
down progressively" or OOMs on long runs. `ContextCleaner` eventually 
unpersists orphaned RDDs once Scala references are GC'd, but the timing is 
non-deterministic and lags the loop body — exactly the failure mode the 
existing `prevSccGraph.unpersist()` was added to prevent.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Algorithm and output of `Graph.stronglyConnectedComponents` are 
unchanged. Only cache lifecycle is adjusted, mirroring the pattern already used 
for `sccGraph` in the same function.
   
   ### How was this patch tested?
   
   - Existing `StronglyConnectedComponentsSuite` continues to pass.
   - Manual verification on a synthetic graph requiring multiple inner 
iterations to converge: the Spark UI **Storage** tab confirms that the count of 
cached `Graph` RDDs remains bounded across iterations after the fix, whereas it 
grew monotonically before.
   - No new tests added — the change preserves observable behavior; correctness 
is covered by the existing suite, and the leak is resource-level (cached block 
count) rather than functional.
   
   ### Related references
   
   - GitHub issue: https://github.com/apache/spark/issues/55553 (this PR)
   - [SPARK-26771](https://issues.apache.org/jira/browse/SPARK-26771) — made 
`unpersist()` non-blocking by default; this PR uses the no-arg `unpersist()` 
form, consistent with that change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to