Uneven distribution of tasks among workers in Spark/GraphX 1.5.0

2015-09-22 Thread dmytro
I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple but shuffle-heavy operation: val g = Graph.fromEdges(edges, ...).partitionBy(...) val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors() subs.saveAsObjectFile("hdfs://...") The job gets divided into 9

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread dmytro
Could it be that your data is skewed? Do you have variable-length column types? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html Sent from the Apache Spark User List mailing list