Re: why one of Stage is into Skipped section instead of Completed

2015-12-27 Thread Prem Spark
Thank you Silvio for the update.

On Sat, Dec 26, 2015 at 1:14 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Skipped stages result from existing shuffle output of a stage when
> re-running a transformation. The executors will have the output of the
> stage in their local dirs and Spark recognizes that, so rather than
> re-computing, it will start from the following stage. So, this is a good
> thing in that you’re not re-computing a stage. In your case, it looks like
> there’s already the output of the userreqs RDD (reduceByKey) so it doesn’t
> re-compute it.
>
> From: Prem Spark 
> Date: Friday, December 25, 2015 at 11:41 PM
> To: "user@spark.apache.org" 
> Subject: why one of Stage is into Skipped section instead of Completed
>
>
> Whats does the below Skipped Stage means. can anyone help in clarifying?
> I was expecting 3 stages to get Succeeded but only 2 of them getting
> completed while one is skipped.
> Status: SUCCEEDED
> Completed Stages: 2
> Skipped Stages: 1
>
> Scala REPL Code Used:
>
> accounts is a basic RDD contains weblog text data.
>
> var accountsByID = accounts.
>
> map(line => line.split(',')).
>
> map(values => (values(0),values(4)+','+values(3)));
>
> var userreqs = sc.
>
> textFile("/loudacre/weblogs/*6").
>
> map(line => line.split(' ')).
>
> map(words => (words(2),1)).
>
> reduceByKey((v1,v2) => v1 + v2);
>
> var accounthits =
>
> accountsByID.join(userreqs).map(pair => pair._2)
>
> accounthits.
>
> saveAsTextFile("/loudacre/userreqs")
>
> scala> accounthits.toDebugString
> res15: String =
> (32) MapPartitionsRDD[24] at map at :28 []
>  |   MapPartitionsRDD[23] at join at :28 []
>  |   MapPartitionsRDD[22] at join at :28 []
>  |   CoGroupedRDD[21] at join at :28 []
>  +-(15) MapPartitionsRDD[15] at map at :25 []
>  |  |   MapPartitionsRDD[14] at map at :24 []
>  |  |   /loudacre/accounts/* MapPartitionsRDD[13] at textFile at
> :21 []
>  |  |   /loudacre/accounts/* HadoopRDD[12] at textFile at :21 []
>  |   ShuffledRDD[20] at reduceByKey at :25 []
>  +-(32) MapPartitionsRDD[19] at map at :24 []
> |   MapPartitionsRDD[18] at map at :23 []
> |   /loudacre/weblogs/*6 MapPartitionsRDD[17] at textFile at
> :22 []
> |   /loudacre/weblogs/*6 HadoopRDD[16] at textFile at 
>
>
>
>
>
>
>


Re: why one of Stage is into Skipped section instead of Completed

2015-12-26 Thread Silvio Fiorito
Skipped stages result from existing shuffle output of a stage when re-running a 
transformation. The executors will have the output of the stage in their local 
dirs and Spark recognizes that, so rather than re-computing, it will start from 
the following stage. So, this is a good thing in that you’re not re-computing a 
stage. In your case, it looks like there’s already the output of the userreqs 
RDD (reduceByKey) so it doesn’t re-compute it.

From: Prem Spark >
Date: Friday, December 25, 2015 at 11:41 PM
To: "user@spark.apache.org" 
>
Subject: why one of Stage is into Skipped section instead of Completed


Whats does the below Skipped Stage means. can anyone help in clarifying?
I was expecting 3 stages to get Succeeded but only 2 of them getting completed 
while one is skipped.
Status: SUCCEEDED
Completed Stages: 2
Skipped Stages: 1

Scala REPL Code Used:

accounts is a basic RDD contains weblog text data.

var accountsByID = accounts.

map(line => line.split(',')).

map(values => (values(0),values(4)+','+values(3)));

var userreqs = sc.

textFile("/loudacre/weblogs/*6").

map(line => line.split(' ')).

map(words => (words(2),1)).

reduceByKey((v1,v2) => v1 + v2);

var accounthits =

accountsByID.join(userreqs).map(pair => pair._2)

accounthits.

saveAsTextFile("/loudacre/userreqs")

scala> accounthits.toDebugString
res15: String =
(32) MapPartitionsRDD[24] at map at :28 []
 |   MapPartitionsRDD[23] at join at :28 []
 |   MapPartitionsRDD[22] at join at :28 []
 |   CoGroupedRDD[21] at join at :28 []
 +-(15) MapPartitionsRDD[15] at map at :25 []
 |  |   MapPartitionsRDD[14] at map at :24 []
 |  |   /loudacre/accounts/* MapPartitionsRDD[13] at textFile at :21 []
 |  |   /loudacre/accounts/* HadoopRDD[12] at textFile at :21 []
 |   ShuffledRDD[20] at reduceByKey at :25 []
 +-(32) MapPartitionsRDD[19] at map at :24 []
|   MapPartitionsRDD[18] at map at :23 []
|   /loudacre/weblogs/*6 MapPartitionsRDD[17] at textFile at :22 []
|   /loudacre/weblogs/*6 HadoopRDD[16] at textFile at 

why one of Stage is into Skipped section instead of Completed

2015-12-25 Thread Prem Spark
Whats does the below Skipped Stage means. can anyone help in clarifying?
I was expecting 3 stages to get Succeeded but only 2 of them getting
completed while one is skipped.
Status: SUCCEEDED
Completed Stages: 2
Skipped Stages: 1

Scala REPL Code Used:

accounts is a basic RDD contains weblog text data.

var accountsByID = accounts.

map(line => line.split(',')).

map(values => (values(0),values(4)+','+values(3)));

var userreqs = sc.

textFile("/loudacre/weblogs/*6").

map(line => line.split(' ')).

map(words => (words(2),1)).

reduceByKey((v1,v2) => v1 + v2);

var accounthits =

accountsByID.join(userreqs).map(pair => pair._2)

accounthits.

saveAsTextFile("/loudacre/userreqs")

scala> accounthits.toDebugString
res15: String =
(32) MapPartitionsRDD[24] at map at :28 []
 |   MapPartitionsRDD[23] at join at :28 []
 |   MapPartitionsRDD[22] at join at :28 []
 |   CoGroupedRDD[21] at join at :28 []
 +-(15) MapPartitionsRDD[15] at map at :25 []
 |  |   MapPartitionsRDD[14] at map at :24 []
 |  |   /loudacre/accounts/* MapPartitionsRDD[13] at textFile at
:21 []
 |  |   /loudacre/accounts/* HadoopRDD[12] at textFile at :21 []
 |   ShuffledRDD[20] at reduceByKey at :25 []
 +-(32) MapPartitionsRDD[19] at map at :24 []
|   MapPartitionsRDD[18] at map at :23 []
|   /loudacre/weblogs/*6 MapPartitionsRDD[17] at textFile at
:22 []
|   /loudacre/weblogs/*6 HadoopRDD[16] at textFile at