RE: Control default partition when load a RDD from HDFS

Shuai Zheng Wed, 17 Dec 2014 07:02:43 -0800

Nice, that is the answer I want. 

Thanks!

From: Sun, Rui [mailto:rui....@intel.com] 
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

Hi, Shuai,

How did you turn off the file split in Hadoop? I guess you might have
implemented a customized FileInputFormat which overrides isSplitable() to
return FALSE. If you do have such FileInputFormat, you can simply pass it as
a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.org
Subject: Control default partition when load a RDD from HDFS

Hi All,

My application load 1000 files, each file from 200M -  a few GB, and combine
with other data to do calculation. 

Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation. 

In Hadoop, it is simple because I can turn-off the file split for input
format (to enforce each file will go to same mapper), then I will do the
file level calculation in mapper and pass result to reducer. But in spark,
how can I do it?

Basically I want to make sure after I load these files into RDD, it is
partitioned by file (not split file and also no merge there), so I can call
mapPartitions. Is it any way I can control the default partition when I load
the RDD? 

This might be the default behavior that spark do the partition (partitioned
by file when first time load the RDD), but I can't find any document to
support my guess, if not, can I enforce this kind of partition? Because the
total file size is bigger, I don't want to re-partition in the code. 

Regards,

Shuai

RE: Control default partition when load a RDD from HDFS

Reply via email to