>From your pseudo code,  it  would be sequential and done twice

1+2+3
then 1+2+4

If  you do a .cache() in step 2 then you would have 1+2+3 , then 4

I ran several steps in parrallel from the same program but never using the
same source  RDD so I do not know the limitations there. I simply started 2
threads and forced execution by calling first() to work around lazy/delay
execution.
I did not find in my case  that it was improving performance b/c the same
amount of total resources was used by Spark.

Stephane





On Sat, Jan 10, 2015 at 11:24 AM, YaoPau <jonrgr...@gmail.com> wrote:

> I'm looking for ways to reduce the runtime of my Spark job.  My code is a
> single file of scala code and is written in this order:
>
> (1) val lines = Import full dataset using sc.textFile
> (2) val ABonly = Parse out all rows that are not of type A or B
> (3) val processA = Process only the A rows from ABonly
> (4) val processB = Process only the B rows from ABonly
>
> Is Spark doing (1) then (2) then (3) then (4) ... or is it by default doing
> (1) then (2) then branching to both (3) and (4) simultaneously and running
> both in parallel?  If not, how can I make that happen?
>
> Jon
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to