I am having a similar problem: I have a large dataset in HDFS and (for a few possible reason including a filter operation, and some of my computation nodes simply not being hdfs datanodes) have a large skew on my RDD blocks: the master node always has the most, while the worker nodes have few... (and the non-hdfs nodes have none)
What is the preferred way to rebalance this RDD across the cluster? Some of my nodes are very underutilized :( I have tried: .coalesce(15000, shuffle = false) which helps a little, but things are still not evenly distributed... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286p12055.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org