I wonder if anyone has any tips for using repartition?

It seems that when you call the repartition method, the entire RDD gets
split up, shuffled, and redistributed... This is an extremely heavy task if
you have a large hdfs dataset and all you want to do is make sure your RDD
is balance/ data skew is minimal...

I have tried coalesce(shuffle=false), but this seems to be somewhat
ineffective at balancing the blocks.

Care to share your experiences?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-inefficient-tp13587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to