Ah, got it. Then takeSample is going to do what you want, because it needs a uniform sample. If you don’t want any result at all, you can also use RDD.foreach() with an empty function.
Matei On Dec 5, 2013, at 12:54 PM, Matt Cheah <mch...@palantir.com> wrote: > Actually, we want the opposite – we want as much data to be computed as > possible. > > It's only for benchmarking purposes, of course. > > -Matt Cheah > > From: Matei Zaharia <matei.zaha...@gmail.com> > Reply-To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org> > Date: Thursday, December 5, 2013 10:31 AM > To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org> > Cc: Mingyu Kim <m...@palantir.com> > Subject: Re: takeSample() computation > > Hi Matt, > > Try using take() instead, which will only begin computing from the start of > the RDD (first partition) if the number of elements you ask for is small. > > Note that if you’re doing any shuffle operations, like groupBy or sort, then > the stages before that do have to be computed fully. > > Matei > > On Dec 5, 2013, at 10:13 AM, Matt Cheah <mch...@palantir.com> wrote: > >> Hi everyone, >> >> I have a question about RDD.takeSample(). This is an action, not a >> transformation – but is any optimization made to reduce the amount of >> computation that's done, for example only running the transformations over a >> smaller subset of the data since only a sample will be returned as a result? >> >> The context is, I'm trying to measure the amount of time a set of >> transformations takes on our dataset without persisting to disk. So I want >> to stack the operations on the RDD and then invoke an action that doesn't >> save the result to disk but can still give me a good idea of how long >> transforming the whole dataset takes. >> >> Thanks, >> >> -Matt Cheah >