Hi all, I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100 million and the number of edges can be 1 billion in our graph. As a result, I must carefully use my limit memory. So I have some questions to the Graphx module.
Why do some transformations like partitionBy, mapVertices cache the new graph and some like outerJoinVertices not? I use Pregel api and just use edgeTriplet.srcAttr in sendMsg, after that I get a new Graph and I use graph.mapReduceTriplets and useedgeTriplet.srcAttr and edgeTriplet.dstAttr in sendMsg. I found that with the implement of ReplicatedVertexView, spark will complute all the graph which should has been computer before. Can anyone explain the implement here? Why dose not VertexPartition extends Serializable? It's used by RDD. Can you provide an "spark.default.cache.useDisk" for using DISK_ONLY in cache by default? - Wu Zeming