Dear developers,

I am running some tests using Pregel API. 

It seems to me that more than 90% of the volume of a graph object is
composed of index structures that will not change during the execution of
Pregel. When the size of a graph is too huge to fit in memory, Pregel will
persist intermediate graphs on disk each time, which seems to involve a lot
of repeated disk savings.

In my test(Shortest Path), I save only one copy of the initial graph and
maintain only a var of RDD[(VertexID, VD)]. To create new messages, I create
a new graph using updated RDD[(VertexId, VD)] and the fixed data in initial
graph during each iteration. Using a slow NTFS hard drive, I did observe
around 40% overall improvement. Note my updateVertices(corresponding to
joinVertices) and edges.upgrade are not optimized yet (they can be optimized
following the follow of GraphX) and the improvement should be from I/O.

So my question is: do you think the current flow of Pregel could be improved
by saving a small portion of a large Graph object? If there are other
concerns, could you explain them?

Best regards,
Fang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Saving-less-data-to-improve-Pregel-performance-in-GraphX-tp18762.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to