Nice, that is the answer I want. Thanks!
From: Sun, Rui [mailto:rui....@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark. From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, December 17, 2014 4:16 AM To: user@spark.apache.org Subject: Control default partition when load a RDD from HDFS Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can turn-off the file split for input format (to enforce each file will go to same mapper), then I will do the file level calculation in mapper and pass result to reducer. But in spark, how can I do it? Basically I want to make sure after I load these files into RDD, it is partitioned by file (not split file and also no merge there), so I can call mapPartitions. Is it any way I can control the default partition when I load the RDD? This might be the default behavior that spark do the partition (partitioned by file when first time load the RDD), but I can't find any document to support my guess, if not, can I enforce this kind of partition? Because the total file size is bigger, I don't want to re-partition in the code. Regards, Shuai