dkranchii opened a new issue, #55553: URL: https://github.com/apache/spark/issues/55553
## Summary `GraphX.StronglyConnectedComponents.run` reassigns the cached `sccWorkGraph` twice per inner iteration without unpersisting the previous generation. The companion variable `sccGraph` *is* handled correctly via `prevSccGraph.unpersist()`, but the equivalent treatment for `sccWorkGraph` was never added — leaving an asymmetric, real cache leak in the public SCC implementation. ## Affected code https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala#L45-L80 ```scala var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache() var prevSccGraph = sccGraph // tracks sccGraph generations // no equivalent for sccWorkGraph while (sccWorkGraph.numVertices > 0 && iter < numIter) { iter += 1 do { numVertices = sccWorkGraph.numVertices sccWorkGraph = sccWorkGraph.outerJoinVertices(sccWorkGraph.outDegrees) { ... } .outerJoinVertices(sccWorkGraph.inDegrees) { ... } .cache() // (1) prior generation never unpersisted // ... derive finalVertices, update sccGraph (correctly tracked) ... sccWorkGraph = sccWorkGraph.subgraph(vpred = ...).cache() // (2) prior generation never unpersisted } while (sccWorkGraph.numVertices < numVertices) } ``` `grep -n unpersist` on this file returns only references to `prevSccGraph.unpersist()` — there is no `sccWorkGraph.unpersist()` anywhere. ## Why it matters - This is the canonical GraphX SCC implementation, surfaced publicly via `Graph[VD,ED].stronglyConnectedComponents(numIter)`. - Each inner do/while iteration pins **two** additional `sccWorkGraph` generations in storage memory. - On graphs that take many inner steps to converge (large SCCs, high diameter), the accumulated cached state evicts useful blocks under `MEMORY_AND_DISK`, triggers recomputation cascades through the join lineage, and ultimately causes executor memory pressure / OOMs that present as "SCC slows down progressively". - `ContextCleaner` will eventually unpersist orphaned RDDs when the JVM GCs Scala references, but timing is non-deterministic and lags the loop body — exactly the failure mode the existing `prevSccGraph.unpersist()` was added to prevent. ## Scope - [x] File touched: `graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala` only - [x] No public API change - [x] No algorithmic / output change - [x] Mirrors the existing `prevSccGraph` lifecycle in the same function - [x] Backportable to all maintained branches ## Verification 1. Run `Graph.stronglyConnectedComponents(numIter = 20)` on any graph that requires multiple inner iterations to converge. 2. Spark UI **Storage** tab: - **Before fix:** the count of cached `Graph` RDDs grows monotonically per inner do/while iteration. - **After fix:** stays bounded at the small constant set already managed by `prevSccGraph` plus its `sccWorkGraph` counterpart. 3. Existing `StronglyConnectedComponentsSuite` continues to pass — output is unchanged. ## Environment - Spark version: master (`4.2.0-dev`); reproducible on `4.1.x`, `4.0.x`, and prior 3.x branches (the bug is unchanged since 2014). - Component: GraphX -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
