rdd.sample() methods very slow

Wang, Ningjun (LNG-NPV) Tue, 19 May 2015 13:46:21 -0700

Hi

I have an RDD[Document] that contains 7 million objects and it is saved in file 
system as object file. I want to get a random sample of about 70 objects from 
it using rdd.sample() method. It is ver slow



val rdd : RDD[Document] = 
sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, 0L).cache()
val count = rdd.count()

>From Spark UI, I see spark is try to read the entire object files at the 
>folder "C:/temp/docs.obj" which is about 29.7 GB. Of course this is very slow. 
>Why does Spark try to read entire 7 million objects while I only need to 
>return a random sample of 70 objects?

Is there any efficient way to get a random sample of 70 objects without reading 
through the entire object files?

Ningjun

rdd.sample() methods very slow

Reply via email to