I am having a similar problem:

I have a large dataset in HDFS and (for a few possible reason including a
filter operation, and some of my computation nodes simply not being hdfs
datanodes) have a large skew on my RDD blocks: the master node always has
the most, while the worker nodes have few... (and the non-hdfs nodes have
none)

What is the preferred way to rebalance this RDD across the cluster? Some of
my nodes are very underutilized :( I have tried:

.coalesce(15000, shuffle = false)

which helps a little, but things are still not evenly distributed...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286p12055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to