I guess the fundamental issue is that these aren't stored in a way
that allows random access to a Document.

Underneath, Hadoop has a concept of a MapFile which is like a
SequenceFile with an index of offsets into the file where records
being. Although Spark doesn't use it, you could maybe create some
custom RDD that takes advantage of this format to grab random elements
efficiently.

Other things come to mind but I think they're all slower -- like
hashing all the docs and taking the smallest n in each of k partitions
to get a pretty uniform random sample of kn docs.


On Thu, May 21, 2015 at 4:04 PM, Wang, Ningjun (LNG-NPV)
<ningjun.w...@lexisnexis.com> wrote:
> Is there any other way to solve the problem? Let me state the use case
>
>
>
> I have an RDD[Document] contains over 7 millions items. The RDD need to be
> save on a persistent storage (currently I save it as object file on disk).
> Then I need to get a small random sample of Document objects (e.g. 10,000
> document). How can I do this quickly? The rdd.sample() methods does not help
> because it need to read the entire RDD of 7 million Document from disk which
> take very long time.
>
>
>
> Ningjun
>
>
>
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Tuesday, May 19, 2015 4:51 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: rdd.sample() methods very slow
>
>
>
> The way these files are accessed is inherently sequential-access. There
> isn't a way to in general know where record N is in a file like this and
> jump to it. So they must be read to be sampled.
>
>
>
>
>
> On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV)
> <ningjun.w...@lexisnexis.com> wrote:
>
> Hi
>
>
>
> I have an RDD[Document] that contains 7 million objects and it is saved in
> file system as object file. I want to get a random sample of about 70
> objects from it using rdd.sample() method. It is ver slow
>
>
>
>
>
> val rdd : RDD[Document] =
> sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D,
> 0L).cache()
>
> val count = rdd.count()
>
>
>
> From Spark UI, I see spark is try to read the entire object files at the
> folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this is very
> slow. Why does Spark try to read entire 7 million objects while I only need
> to return a random sample of 70 objects?
>
>
>
> Is there any efficient way to get a random sample of 70 objects without
> reading through the entire object files?
>
>
>
> Ningjun
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to