Dear Spark developers, I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
On each iteration the new graph is created and cached, and the old graph is un-cached: 1) Create new graph and cache it: rankGraph = rankGraph.joinVertices(rankUpdates) { (id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum }.cache() 2) Unpersist the old one: prevRankGraph.vertices.unpersist(false) prevRankGraph.edges.unpersist(false) According to the code, at the end of each iteration only one graph should be in memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly between the mentioned lines of code, there will be two graphs: old and new. It is two pairs of Edge and Vertex RDDs. However, when I run the example provided in Spark examples folder, I observe the different behavior. Run the example (I checked that it runs the mentioned code): $SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.graphx.SynthBenchmark" --master spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar According to "Storage" and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are cached, even when all iterations are finished, given that the mentioned code suggests caching at most 2 (and only in particular stage of the iteration): https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing Edges (the green ones are cached): https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing Vertices (the green ones are cached): https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached? Is it OK that there is a double caching in code, given that joinVertices implicitly caches vertices and then the graph is cached in the PageRank code? Best regards, Alexander