I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple
but
shuffle-heavy operation:
val g = Graph.fromEdges(edges, ...).partitionBy(...)
val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors()
subs.saveAsObjectFile("hdfs://...")
The job gets divided into 9
Could it be that your data is skewed? Do you have variable-length column
types?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
Sent from the Apache Spark User List mailing list