Hi
I have an RDD[Document] that contains 7 million objects and it is saved in file
system as object file. I want to get a random sample of about 70 objects from
it using rdd.sample() method. It is ver slow
val rdd : RDD[Document] =
sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, 0L).cache()
val count = rdd.count()
>From Spark UI, I see spark is try to read the entire object files at the
>folder "C:/temp/docs.obj" which is about 29.7 GB. Of course this is very slow.
>Why does Spark try to read entire 7 million objects while I only need to
>return a random sample of 70 objects?
Is there any efficient way to get a random sample of 70 objects without reading
through the entire object files?
Ningjun