Re: How to persist RDD return from partitionBy() to disk?

Imran Rashid Fri, 17 Apr 2015 12:54:12 -0700

https://issues.apache.org/jira/browse/SPARK-1061


note the proposed fix isn't to have spark automatically know about the
partitioner when it reloads the data, but at least to make it *possible*
for it to be done at the application level.

On Fri, Apr 17, 2015 at 11:35 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  I have a huge RDD[Document] with millions of items. I partitioned it
> using HashPartitioner and save as object file. But when I load the object
> file back into RDD, I lost the HashPartitioner. How do I preserve the
> partitions when loading the object file?
>
>
>
> Here is the code
>
>
>
> *val *docVectors : RDD[DocVector] = computeRdd() // expensive calculation
>
>
>
> *val *partitionedDocVectors : RDD[(String, DocVector)] = docVectors .keyBy(d
> => d.id).partitionBy(*new *HashPartitioner(16))
> partitionedDocVectors.saveAsObjectFile(
> *"c:/temp/partitionedDocVectors.obj"*)
>
> // At this point, I check the folder *c:/temp/partitionedDocVectors.obj,
> it contains 16 parts: “part-00000, part-00001, … part-00015”*
>
>
>
> // Now laod the object file back
> *val *partitionedDocVectors2 : RDD[(String, DocVector)] = sc.objectFile(
> *"c:/temp/partitionedDocVectors.obj"*)
>
> // Now partitionedDocVectors2 contains 956 parts and it has no partinier
>
>
> *println*(*s"partitions: **$*{partitionedDocVectors.partitions.size}*"*)
> // return 956
> *if *(idAndDocVectors.partitioner.isEmpty) *println*(*"No partitioner"*)
> // it does print out this line
>
>
>
> So how can I preserve the partitions of partitionedDocVectors on disk so
> I can load it back?
>
>
>
> Ningjun
>
>
>

Re: How to persist RDD return from partitionBy() to disk?

Reply via email to