Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

Saisai Shao Wed, 24 Aug 2016 06:02:45 -0700

Spark Shuffle uses Java File related API to create local dirs and R/W data,
so it can only be worked with OS supported FS. It doesn't leverage Hadoop
FileSystem API, so writing to Hadoop compatible FS is not worked.


Also it is not suitable to write temporary shuffle data into distributed
FS, this will bring unnecessary overhead. In you case if you have large
memory on each node, you could use ramfs instead to store shuffle data.

Thanks
Saisai

On Wed, Aug 24, 2016 at 8:11 PM, tony....@tendcloud.com <
tony....@tendcloud.com> wrote:

> Hi, All,
> When we run Spark on very large data, spark will do shuffle and the
> shuffle data will write to local disk. Because we have limited capacity at
> local disk, the shuffled data will occupied all of the local disk and then
> will be failed.  So is there a way we can write the shuffle spill data to
> HDFS? Or if we introduce alluxio in our system, can the shuffled data write
> to alluxio?
>
> Thanks and Regards,
>
> ------------------------------
> 阎志涛(Tony)
>
> 北京腾云天下科技有限公司
> -------------------------------------------------------------------------
> -------------------------------
> 邮箱：tony....@tendcloud.com
> 电话：13911815695
> 微信： zhitao_yan
> QQ ： 4707059
> 地址：北京市东城区东直门外大街39号院2号楼航空服务大厦602室
> 邮编：100027
> ------------------------------------------------------------
> --------------------------------------------
> TalkingData.com <http://talkingdata.com/> - 让数据说话
>

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

Reply via email to