subject:"How to persist RDD return from partitionBy\(\) to disk\?"

How to persist RDD return from partitionBy() to disk?

2015-04-17 Thread Wang, Ningjun (LNG-NPV)

I have a huge RDD[Document] with millions of items. I partitioned it using HashPartitioner and save as object file. But when I load the object file back into RDD, I lost the HashPartitioner. How do I preserve the partitions when loading the object file? Here is the code val docVectors : RDD[D

Re: How to persist RDD return from partitionBy() to disk?

2015-04-17 Thread Imran Rashid

https://issues.apache.org/jira/browse/SPARK-1061 note the proposed fix isn't to have spark automatically know about the partitioner when it reloads the data, but at least to make it *possible* for it to be done at the application level. On Fri, Apr 17, 2015 at 11:35 AM, Wang, Ningjun (LNG-NPV) <