Is there any other way to solve the problem? Let me state the use case

I have an RDD[Document] contains over 7 millions items. The RDD need to be save 
on a persistent storage (currently I save it as object file on disk). Then I 
need to get a small random sample of Document objects (e.g. 10,000 document). 
How can I do this quickly? The rdd.sample() methods does not help because it 
need to read the entire RDD of 7 million Document from disk which take very 
long time.

Ningjun

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Tuesday, May 19, 2015 4:51 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow

The way these files are accessed is inherently sequential-access. There isn't a 
way to in general know where record N is in a file like this and jump to it. So 
they must be read to be sampled.


On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) 
<ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
Hi

I have an RDD[Document] that contains 7 million objects and it is saved in file 
system as object file. I want to get a random sample of about 70 objects from 
it using rdd.sample() method. It is ver slow


val rdd : RDD[Document] = 
sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, 0L).cache()
val count = rdd.count()

From Spark UI, I see spark is try to read the entire object files at the folder 
“C:/temp/docs.obj” which is about 29.7 GB. Of course this is very slow. Why 
does Spark try to read entire 7 million objects while I only need to return a 
random sample of 70 objects?

Is there any efficient way to get a random sample of 70 objects without reading 
through the entire object files?

Ningjun


Reply via email to