Hi

I have an RDD[Document] that contains 7 million objects and it is saved in file 
system as object file. I want to get a random sample of about 70 objects from 
it using rdd.sample() method. It is ver slow


val rdd : RDD[Document] = 
sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, 0L).cache()
val count = rdd.count()

>From Spark UI, I see spark is try to read the entire object files at the 
>folder "C:/temp/docs.obj" which is about 29.7 GB. Of course this is very slow. 
>Why does Spark try to read entire 7 million objects while I only need to 
>return a random sample of 70 objects?

Is there any efficient way to get a random sample of 70 objects without reading 
through the entire object files?

Ningjun

Reply via email to