Hi Imran,

Maybe you can try to implement your own InputFormat and InputSplit to control 
your own partition and read strategy, Spark supports custom InputFormat in 
HadoopRDD.

Thanks
Jerry
From: Imran Rashid [mailto:im...@quantifind.com]
Sent: Tuesday, January 28, 2014 3:58 PM
To: user@spark.incubator.apache.org
Subject: setting partitioners with hadoop rdds

Hi,

I'm trying to figure out how to get partitioners to work correctly with hadoop 
rdds, so that I can get narrow dependencies & avoid shuffling.  I feel like I 
must be missing something obvious.
I can create an RDD with a parititioner of my choosing, shuffle it and then 
save it out to hdfs.  But I can't figure out how to get it to still have that 
partitioner after I read it back in from hdfs.  HadoopRDD always has the 
partitioner set to None, and there isn't any way for me to change it.
the reason I care is b/c if I can set the partitioner, then there would be a 
narrow dependency, so I can avoid a shuffle.  I have a big data set I'm saving 
on hdfs.  Then some time later, in a totally independent spark context, I read 
a little more data in, shuffle it w/ the same partitioner, and then want to 
join it to the previous data that was sitting on hdfs.
I guess this can't be done in general, since you don't have any guarantees on 
the how the file was saved in hdfs.  But, it still seems like there ought to be 
a way to do this, even if I need to enforce safety at the application level.
sorry if I'm missing something obvious ...

thanks,
Imran

Reply via email to