RE: Control default partition when load a RDD from HDFS

2014-12-18 Thread Shuai Zheng
] Sent: Wednesday, December 17, 2014 11:04 AM To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Why not is a good option to create a RDD per each 200Mb file and then apply the pre-calculations before merging them? I think

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Shuai Zheng
Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Diego García Valverde
[mailto:szheng.c...@gmail.com] Enviado el: miércoles, 17 de diciembre de 2014 16:01 Para: 'Sun, Rui'; user@spark.apache.org Asunto: RE: Control default partition when load a RDD from HDFS Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30

RE: Control default partition when load a RDD from HDFS

2014-12-16 Thread Sun, Rui
Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.