The way these files are accessed is inherently sequential-access. There
isn't a way to in general know where record N is in a file like this and
jump to it. So they must be read to be sampled.


On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Hi
>
>
>
> I have an RDD[Document] that contains 7 million objects and it is saved in
> file system as object file. I want to get a random sample of about 70
> objects from it using rdd.sample() method. It is ver slow
>
>
>
>
>
> val rdd : RDD[Document] =
> sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D,
> 0L).cache()
>
> val count = rdd.count()
>
>
>
> From Spark UI, I see spark is try to read the entire object files at the
> folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this is very
> slow. Why does Spark try to read entire 7 million objects while I only need
> to return a random sample of 70 objects?
>
>
>
> Is there any efficient way to get a random sample of 70 objects without
> reading through the entire object files?
>
>
>
> Ningjun
>
>
>

Reply via email to