Ah, got it. Then takeSample is going to do what you want, because it needs a 
uniform sample. If you don’t want any result at all, you can also use 
RDD.foreach() with an empty function.

Matei

On Dec 5, 2013, at 12:54 PM, Matt Cheah <mch...@palantir.com> wrote:

> Actually, we want the opposite – we want as much data to be computed as 
> possible.
> 
> It's only for benchmarking purposes, of course.
> 
> -Matt Cheah
> 
> From: Matei Zaharia <matei.zaha...@gmail.com>
> Reply-To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
> Date: Thursday, December 5, 2013 10:31 AM
> To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
> Cc: Mingyu Kim <m...@palantir.com>
> Subject: Re: takeSample() computation
> 
> Hi Matt,
> 
> Try using take() instead, which will only begin computing from the start of 
> the RDD (first partition) if the number of elements you ask for is small.
> 
> Note that if you’re doing any shuffle operations, like groupBy or sort, then 
> the stages before that do have to be computed fully.
> 
> Matei
> 
> On Dec 5, 2013, at 10:13 AM, Matt Cheah <mch...@palantir.com> wrote:
> 
>> Hi everyone,
>> 
>> I have a question about RDD.takeSample(). This is an action, not a 
>> transformation – but is any optimization made to reduce the amount of 
>> computation that's done, for example only running the transformations over a 
>> smaller subset of the data since only a sample will be returned as a result?
>> 
>> The context is, I'm trying to measure the amount of time a set of 
>> transformations takes on our dataset without persisting to disk. So I want 
>> to stack the operations on the RDD and then invoke an action that doesn't 
>> save the result to disk but can still give me a good idea of how long 
>> transforming the whole dataset takes.
>> 
>> Thanks,
>> 
>> -Matt Cheah
> 

Reply via email to