dkranchii opened a new pull request, #55554: URL: https://github.com/apache/spark/pull/55554
### What changes were proposed in this pull request? `StronglyConnectedComponents.run` mirrors the existing `prevSccGraph` unpersist lifecycle for `sccWorkGraph`. Each of the two `sccWorkGraph` reassignments inside the inner `do/while` loop (the `outerJoinVertices` step and the `subgraph` step) now materializes the new generation via `vertices.count()` / `edges.count()` and then unpersists the previous generation, using the same idiom already in place for `sccGraph`. Tracking issue: https://github.com/apache/spark/issues/55553 ### Why are the changes needed? The two `sccWorkGraph = ... .cache()` reassignments in the inner loop each leak a cached `Graph` generation per iteration. SPARK-16478 (2016) added the equivalent unpersist lifecycle for `sccGraph` but did not extend it to `sccWorkGraph`, leaving an asymmetric cache leak in the public SCC implementation: ``` grep -n unpersist graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala # Only references prevSccGraph; no sccWorkGraph.unpersist() exists prior to this PR. ``` On graphs that take many inner iterations to converge (large SCCs, high diameter), the accumulated cached state evicts useful blocks under `MEMORY_AND_DISK`, triggers recomputation cascades through the join lineage, and ultimately produces executor memory pressure that presents as "SCC slows down progressively" or OOMs on long runs. `ContextCleaner` eventually unpersists orphaned RDDs once Scala references are GC'd, but the timing is non-deterministic and lags the loop body — exactly the failure mode the existing `prevSccGraph.unpersist()` was added to prevent. ### Does this PR introduce _any_ user-facing change? No. Algorithm and output of `Graph.stronglyConnectedComponents` are unchanged. Only cache lifecycle is adjusted, mirroring the pattern already used for `sccGraph` in the same function. ### How was this patch tested? - Existing `StronglyConnectedComponentsSuite` continues to pass. - Manual verification on a synthetic graph requiring multiple inner iterations to converge: the Spark UI **Storage** tab confirms that the count of cached `Graph` RDDs remains bounded across iterations after the fix, whereas it grew monotonically before. - No new tests added — the change preserves observable behavior; correctness is covered by the existing suite, and the leak is resource-level (cached block count) rather than functional. ### Related references - GitHub issue: https://github.com/apache/spark/issues/55553 (this PR) - [SPARK-26771](https://issues.apache.org/jira/browse/SPARK-26771) — made `unpersist()` non-blocking by default; this PR uses the no-arg `unpersist()` form, consistent with that change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
