Hey Imran, You probably have to create a subclass of HadoopRDD to do this, or some RDD that wraps around the HadoopRDD. It would be a cool feature but HDFS itself has no information about partitioning, so your application needs to track it.
Matei On Jan 27, 2014, at 11:57 PM, Imran Rashid <im...@quantifind.com> wrote: > Hi, > > > I'm trying to figure out how to get partitioners to work correctly with > hadoop rdds, so that I can get narrow dependencies & avoid shuffling. I feel > like I must be missing something obvious. > > I can create an RDD with a parititioner of my choosing, shuffle it and then > save it out to hdfs. But I can't figure out how to get it to still have that > partitioner after I read it back in from hdfs. HadoopRDD always has the > partitioner set to None, and there isn't any way for me to change it. > > the reason I care is b/c if I can set the partitioner, then there would be a > narrow dependency, so I can avoid a shuffle. I have a big data set I'm > saving on hdfs. Then some time later, in a totally independent spark > context, I read a little more data in, shuffle it w/ the same partitioner, > and then want to join it to the previous data that was sitting on hdfs. > > I guess this can't be done in general, since you don't have any guarantees on > the how the file was saved in hdfs. But, it still seems like there ought to > be a way to do this, even if I need to enforce safety at the application > level. > > sorry if I'm missing something obvious ... > > thanks, > Imran