Hmmm, how to do that? You mean for each file create a RDD? Then I will have
tons of RDD.

And my calculation need to rely on other input, not just the file itself

 

Can you show some pseudo code for that logic?

 

Regards,

 

Shuai

 

From: Diego García Valverde [mailto:dgarci...@agbar.es] 
Sent: Wednesday, December 17, 2014 11:04 AM
To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

 

Why not is a good option to create a RDD per each 200Mb file and then apply
the pre-calculations before merging them? I think the partitions per RDD
must be transparent to the pre-calculations, and not to set them fixed to
optimize the spark maps/reduces processes.

 

De: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Enviado el: miércoles, 17 de diciembre de 2014 16:01
Para: 'Sun, Rui'; user@spark.apache.org
Asunto: RE: Control default partition when load a RDD from HDFS

 

Nice, that is the answer I want. 

Thanks!

 

From: Sun, Rui [mailto:rui....@intel.com] 
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

 

Hi, Shuai,

 

How did you turn off the file split in Hadoop? I guess you might have
implemented a customized FileInputFormat which overrides isSplitable() to
return FALSE. If you do have such FileInputFormat, you can simply pass it as
a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

 

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.org
Subject: Control default partition when load a RDD from HDFS

 

Hi All,

 

My application load 1000 files, each file from 200M –  a few GB, and combine
with other data to do calculation. 

Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation. 

In Hadoop, it is simple because I can turn-off the file split for input
format (to enforce each file will go to same mapper), then I will do the
file level calculation in mapper and pass result to reducer. But in spark,
how can I do it?

Basically I want to make sure after I load these files into RDD, it is
partitioned by file (not split file and also no merge there), so I can call
mapPartitions. Is it any way I can control the default partition when I load
the RDD? 

This might be the default behavior that spark do the partition (partitioned
by file when first time load the RDD), but I can’t find any document to
support my guess, if not, can I enforce this kind of partition? Because the
total file size is bigger, I don’t want to re-partition in the code. 

 

Regards,

 

Shuai

 

  _____  

Disclaimer: http://disclaimer.agbar.com

Reply via email to