I think for executor distribution, normally On YARN mode, RM tries its best to evenly distribute container if don't explicitly specify the preferred host. For standalone mode, one node only has one executor normally, so executor distribution is not a big problem normally.
The problem of data skew will lead to unbalanced task execution time and intermediate data spill. If some of your nodes processed large parts of data, these nodes will have more spilled data and will easily meet out of disk space. I'm not sure if you actually meet such problem, this problem is hard to solve, need to fix from data and application's implementation level IMO. 2015-05-06 21:21 GMT+08:00 Yifan LI <iamyifa...@gmail.com>: > Yes, you are right. For now I have to say the workload/executor is > distributed evenly…so, like you said, it is difficult to improve the > situation. > > However, have you any idea of how to make a *skew* data/executor > distribution? > > > > Best, > Yifan LI > > > > > > On 06 May 2015, at 15:13, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > I think it depends on your workload and executor distribution, if your > workload is evenly distributed without any big data skew, and executors are > evenly distributed on each nodes, the storage usage of each node is nearly > the same. Spark itself cannot rebalance the storage overhead as you > mentioned. > > 2015-05-06 21:09 GMT+08:00 Yifan LI <iamyifa...@gmail.com>: > >> Thanks, Shao. :-) >> >> I am wondering if the spark will rebalance the storage overhead in >> runtime…since still there is some available space on other nodes. >> >> >> Best, >> Yifan LI >> >> >> >> >> >> On 06 May 2015, at 14:57, Saisai Shao <sai.sai.s...@gmail.com> wrote: >> >> I think you could configure multiple disks through spark.local.dir, >> default is /tmp. Anyway if your intermediate data is larger than available >> disk space, still will meet this issue. >> >> spark.local.dir/tmpDirectory to use for "scratch" space in Spark, >> including map output files and RDDs that get stored on disk. This should be >> on a fast, local disk in your system. It can also be a comma-separated list >> of multiple directories on different disks. NOTE: In Spark 1.0 and later >> this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or >> LOCAL_DIRS (YARN) environment variables set by the cluster manager. >> >> 2015-05-06 20:35 GMT+08:00 Yifan LI <iamyifa...@gmail.com>: >> >>> Hi, >>> >>> I am running my graphx application on Spark, but it failed since there >>> is an error on one executor node(on which available hdfs space is small) >>> that “no space left on device”. >>> >>> I can understand why it happened, because my vertex(-attribute) rdd was >>> becoming bigger and bigger during computation…, so maybe sometime the >>> request on that node was too bigger than available space. >>> >>> But, is there any way to avoid this kind of error? I am sure that the >>> overall disk space of all nodes is enough for my application. >>> >>> Thanks in advance! >>> >>> >>> >>> Best, >>> Yifan LI >>> >>> >>> >>> >>> >>> >> >> > >