Spark - launchng job for each action

2015-09-06 Thread Priya Ch
Hi All,

 In Spark, each action results in launching a job. Lets say my spark app
looks as-

val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
val rdd1 = baseRdd.map(x => x+2)
val rdd2 = rdd1.filter(x => x%2 ==0)
val count = rdd2.count
val firstElement = rdd2.first

println("Count is"+count)
println("First is"+firstElement)

Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
with 1 task. Here in job2, when calculating rdd.first, is the entire
lineage computed again or else as job0 already computes rdd2, is it reused
???

Thanks,
Padma Ch


Re: Spark - launchng job for each action

2015-09-06 Thread ayan guha
Hi

"... Here in job2, when calculating rdd.first..."

If you mean if rdd2.first, then it uses rdd2 already computed by
rdd2.count, because it is already available. If some partitions are not
available due to GC, then only those partitions are recomputed.

On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang  wrote:

> If you want to reuse the data, you need to call rdd2.cache
>
>
>
> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch 
> wrote:
>
>> Hi All,
>>
>>  In Spark, each action results in launching a job. Lets say my spark app
>> looks as-
>>
>> val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
>> val rdd1 = baseRdd.map(x => x+2)
>> val rdd2 = rdd1.filter(x => x%2 ==0)
>> val count = rdd2.count
>> val firstElement = rdd2.first
>>
>> println("Count is"+count)
>> println("First is"+firstElement)
>>
>> Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
>> with 1 task. Here in job2, when calculating rdd.first, is the entire
>> lineage computed again or else as job0 already computes rdd2, is it reused
>> ???
>>
>> Thanks,
>> Padma Ch
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards,
Ayan Guha


Re: Spark - launchng job for each action

2015-09-06 Thread Priya Ch
Hi All,

 Thanks for the info. I have one more doubt -
When writing a streaming application, I specify batch-interval. Lets say if
the interval is 1sec, for every 1sec batch, rdd is formed and launches a
job. If there are >1 action specified on an rddhow many jobs would it
launch???

I mean every 1sec batch launches a job and suppose there are two actions
then internally 2 more jobs launched ?

On Sun, Sep 6, 2015 at 1:15 PM, ayan guha  wrote:

> Hi
>
> "... Here in job2, when calculating rdd.first..."
>
> If you mean if rdd2.first, then it uses rdd2 already computed by
> rdd2.count, because it is already available. If some partitions are not
> available due to GC, then only those partitions are recomputed.
>
> On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang  wrote:
>
>> If you want to reuse the data, you need to call rdd2.cache
>>
>>
>>
>> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch 
>> wrote:
>>
>>> Hi All,
>>>
>>>  In Spark, each action results in launching a job. Lets say my spark app
>>> looks as-
>>>
>>> val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
>>> val rdd1 = baseRdd.map(x => x+2)
>>> val rdd2 = rdd1.filter(x => x%2 ==0)
>>> val count = rdd2.count
>>> val firstElement = rdd2.first
>>>
>>> println("Count is"+count)
>>> println("First is"+firstElement)
>>>
>>> Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
>>> with 1 task. Here in job2, when calculating rdd.first, is the entire
>>> lineage computed again or else as job0 already computes rdd2, is it reused
>>> ???
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>