dkranchii opened a new issue, #55553:
URL: https://github.com/apache/spark/issues/55553

   ## Summary
   
   `GraphX.StronglyConnectedComponents.run` reassigns the cached `sccWorkGraph` 
twice per inner iteration without unpersisting the previous generation. The 
companion variable `sccGraph` *is* handled correctly via 
`prevSccGraph.unpersist()`, but the equivalent treatment for `sccWorkGraph` was 
never added — leaving an asymmetric, real cache leak in the public SCC 
implementation.
   
   ## Affected code
   
   
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala#L45-L80
   
   ```scala
   var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) 
}.cache()
   var prevSccGraph = sccGraph    // tracks sccGraph generations
   // no equivalent for sccWorkGraph
   
   while (sccWorkGraph.numVertices > 0 && iter < numIter) {
     iter += 1
     do {
       numVertices = sccWorkGraph.numVertices
       sccWorkGraph = sccWorkGraph.outerJoinVertices(sccWorkGraph.outDegrees) { 
... }
                                   .outerJoinVertices(sccWorkGraph.inDegrees) { 
... }
                                   .cache()        // (1) prior generation 
never unpersisted
   
       // ... derive finalVertices, update sccGraph (correctly tracked) ...
   
       sccWorkGraph = sccWorkGraph.subgraph(vpred = ...).cache()   // (2) prior 
generation never unpersisted
     } while (sccWorkGraph.numVertices < numVertices)
   }
   ```
   
   `grep -n unpersist` on this file returns only references to 
`prevSccGraph.unpersist()` — there is no `sccWorkGraph.unpersist()` anywhere.
   
   ## Why it matters
   
   - This is the canonical GraphX SCC implementation, surfaced publicly via 
`Graph[VD,ED].stronglyConnectedComponents(numIter)`.
   - Each inner do/while iteration pins **two** additional `sccWorkGraph` 
generations in storage memory.
   - On graphs that take many inner steps to converge (large SCCs, high 
diameter), the accumulated cached state evicts useful blocks under 
`MEMORY_AND_DISK`, triggers recomputation cascades through the join lineage, 
and ultimately causes executor memory pressure / OOMs that present as "SCC 
slows down progressively".
   - `ContextCleaner` will eventually unpersist orphaned RDDs when the JVM GCs 
Scala references, but timing is non-deterministic and lags the loop body — 
exactly the failure mode the existing `prevSccGraph.unpersist()` was added to 
prevent.
   
   ## Scope
   
   - [x] File touched: 
`graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala`
 only
   - [x] No public API change
   - [x] No algorithmic / output change
   - [x] Mirrors the existing `prevSccGraph` lifecycle in the same function
   - [x] Backportable to all maintained branches
   
   ## Verification
   
   1. Run `Graph.stronglyConnectedComponents(numIter = 20)` on any graph that 
requires multiple inner iterations to converge.
   2. Spark UI **Storage** tab:
      - **Before fix:** the count of cached `Graph` RDDs grows monotonically 
per inner do/while iteration.
      - **After fix:** stays bounded at the small constant set already managed 
by `prevSccGraph` plus its `sccWorkGraph` counterpart.
   3. Existing `StronglyConnectedComponentsSuite` continues to pass — output is 
unchanged.
   
   ## Environment
   
   - Spark version: master (`4.2.0-dev`); reproducible on `4.1.x`, `4.0.x`, and 
prior 3.x branches (the bug is unchanged since 2014).
   - Component: GraphX
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to