The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled.
On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Hi > > > > I have an RDD[Document] that contains 7 million objects and it is saved in > file system as object file. I want to get a random sample of about 70 > objects from it using rdd.sample() method. It is ver slow > > > > > > val rdd : RDD[Document] = > sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D, > 0L).cache() > > val count = rdd.count() > > > > From Spark UI, I see spark is try to read the entire object files at the > folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this is very > slow. Why does Spark try to read entire 7 million objects while I only need > to return a random sample of 70 objects? > > > > Is there any efficient way to get a random sample of 70 objects without > reading through the entire object files? > > > > Ningjun > > >