Hey Imran,

You probably have to create a subclass of HadoopRDD to do this, or some RDD 
that wraps around the HadoopRDD. It would be a cool feature but HDFS itself has 
no information about partitioning, so your application needs to track it.

Matei

On Jan 27, 2014, at 11:57 PM, Imran Rashid <im...@quantifind.com> wrote:

> Hi,
> 
> 
> I'm trying to figure out how to get partitioners to work correctly with 
> hadoop rdds, so that I can get narrow dependencies & avoid shuffling.  I feel 
> like I must be missing something obvious.
> 
> I can create an RDD with a parititioner of my choosing, shuffle it and then 
> save it out to hdfs.  But I can't figure out how to get it to still have that 
> partitioner after I read it back in from hdfs.  HadoopRDD always has the 
> partitioner set to None, and there isn't any way for me to change it.
> 
> the reason I care is b/c if I can set the partitioner, then there would be a 
> narrow dependency, so I can avoid a shuffle.  I have a big data set I'm 
> saving on hdfs.  Then some time later, in a totally independent spark 
> context, I read a little more data in, shuffle it w/ the same partitioner, 
> and then want to join it to the previous data that was sitting on hdfs.
> 
> I guess this can't be done in general, since you don't have any guarantees on 
> the how the file was saved in hdfs.  But, it still seems like there ought to 
> be a way to do this, even if I need to enforce safety at the application 
> level.
> 
> sorry if I'm missing something obvious ...
> 
> thanks,
> Imran

Reply via email to