Dear developers, I am running some tests using Pregel API.
It seems to me that more than 90% of the volume of a graph object is composed of index structures that will not change during the execution of Pregel. When the size of a graph is too huge to fit in memory, Pregel will persist intermediate graphs on disk each time, which seems to involve a lot of repeated disk savings. In my test(Shortest Path), I save only one copy of the initial graph and maintain only a var of RDD[(VertexID, VD)]. To create new messages, I create a new graph using updated RDD[(VertexId, VD)] and the fixed data in initial graph during each iteration. Using a slow NTFS hard drive, I did observe around 40% overall improvement. Note my updateVertices(corresponding to joinVertices) and edges.upgrade are not optimized yet (they can be optimized following the follow of GraphX) and the improvement should be from I/O. So my question is: do you think the current flow of Pregel could be improved by saving a small portion of a large Graph object? If there are other concerns, could you explain them? Best regards, Fang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Saving-less-data-to-improve-Pregel-performance-in-GraphX-tp18762.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org