Hey all, I’ve often found that my spark programs run much more stable with a higher number of partitions, and a lot of the graphs I deal with will have a few hundred large part files. I was wondering if having a parameter in GraphLoader, defaulting to false, to set the shuffle parameter in coalesce is something that might be added to graphx, or if there was a good reason for not including it? I’ve been using this patch myself for a couple weeks.
—Jeff
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
index f4c7936..b2f9e9c 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
@@ -58,13 +58,14 @@ object GraphLoader extends Logging {
canonicalOrientation: Boolean = false,
minEdgePartitions: Int = 1,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
- vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+ vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
+ shuffle: Boolean = false)
: Graph[Int, Int] =
{
val startTime = System.currentTimeMillis
// Parse the edge data table directly into edge partitions
- val lines = sc.textFile(path,
minEdgePartitions).coalesce(minEdgePartitions)
+ val lines = sc.textFile(path,
minEdgePartitions).coalesce(minEdgePartitions, shuffle)
val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
val builder = new EdgePartitionBuilder[Int, Int]
iter.foreach { line =>
signature.asc
Description: Message signed with OpenPGP using GPGMail
