Is there any other way to solve the problem? Let me state the use case I have an RDD[Document] contains over 7 millions items. The RDD need to be save on a persistent storage (currently I save it as object file on disk). Then I need to get a small random sample of Document objects (e.g. 10,000 document). How can I do this quickly? The rdd.sample() methods does not help because it need to read the entire RDD of 7 million Document from disk which take very long time.
Ningjun From: Sean Owen [mailto:so...@cloudera.com] Sent: Tuesday, May 19, 2015 4:51 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) <ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote: Hi I have an RDD[Document] that contains 7 million objects and it is saved in file system as object file. I want to get a random sample of about 70 objects from it using rdd.sample() method. It is ver slow val rdd : RDD[Document] = sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, 0L).cache() val count = rdd.count() From Spark UI, I see spark is try to read the entire object files at the folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this is very slow. Why does Spark try to read entire 7 million objects while I only need to return a random sample of 70 objects? Is there any efficient way to get a random sample of 70 objects without reading through the entire object files? Ningjun