I wonder if anyone has any tips for using repartition? It seems that when you call the repartition method, the entire RDD gets split up, shuffled, and redistributed... This is an extremely heavy task if you have a large hdfs dataset and all you want to do is make sure your RDD is balance/ data skew is minimal...
I have tried coalesce(shuffle=false), but this seems to be somewhat ineffective at balancing the blocks. Care to share your experiences? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-inefficient-tp13587.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org