Hi I have an RDD[Document] that contains 7 million objects and it is saved in file system as object file. I want to get a random sample of about 70 objects from it using rdd.sample() method. It is ver slow
val rdd : RDD[Document] = sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, 0L).cache() val count = rdd.count() >From Spark UI, I see spark is try to read the entire object files at the >folder "C:/temp/docs.obj" which is about 29.7 GB. Of course this is very slow. >Why does Spark try to read entire 7 million objects while I only need to >return a random sample of 70 objects? Is there any efficient way to get a random sample of 70 objects without reading through the entire object files? Ningjun