https://issues.apache.org/jira/browse/SPARK-1061
note the proposed fix isn't to have spark automatically know about the partitioner when it reloads the data, but at least to make it *possible* for it to be done at the application level. On Fri, Apr 17, 2015 at 11:35 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I have a huge RDD[Document] with millions of items. I partitioned it > using HashPartitioner and save as object file. But when I load the object > file back into RDD, I lost the HashPartitioner. How do I preserve the > partitions when loading the object file? > > > > Here is the code > > > > *val *docVectors : RDD[DocVector] = computeRdd() // expensive calculation > > > > *val *partitionedDocVectors : RDD[(String, DocVector)] = docVectors .keyBy(d > => d.id).partitionBy(*new *HashPartitioner(16)) > partitionedDocVectors.saveAsObjectFile( > *"c:/temp/partitionedDocVectors.obj"*) > > // At this point, I check the folder *c:/temp/partitionedDocVectors.obj, > it contains 16 parts: “part-00000, part-00001, … part-00015”* > > > > // Now laod the object file back > *val *partitionedDocVectors2 : RDD[(String, DocVector)] = sc.objectFile( > *"c:/temp/partitionedDocVectors.obj"*) > > // Now partitionedDocVectors2 contains 956 parts and it has no partinier > > > *println*(*s"partitions: **$*{partitionedDocVectors.partitions.size}*"*) > // return 956 > *if *(idAndDocVectors.partitioner.isEmpty) *println*(*"No partitioner"*) > // it does print out this line > > > > So how can I preserve the partitions of partitionedDocVectors on disk so > I can load it back? > > > > Ningjun > > >