>From your pseudo code, it would be sequential and done twice 1+2+3 then 1+2+4
If you do a .cache() in step 2 then you would have 1+2+3 , then 4 I ran several steps in parrallel from the same program but never using the same source RDD so I do not know the limitations there. I simply started 2 threads and forced execution by calling first() to work around lazy/delay execution. I did not find in my case that it was improving performance b/c the same amount of total resources was used by Spark. Stephane On Sat, Jan 10, 2015 at 11:24 AM, YaoPau <jonrgr...@gmail.com> wrote: > I'm looking for ways to reduce the runtime of my Spark job. My code is a > single file of scala code and is written in this order: > > (1) val lines = Import full dataset using sc.textFile > (2) val ABonly = Parse out all rows that are not of type A or B > (3) val processA = Process only the A rows from ABonly > (4) val processB = Process only the B rows from ABonly > > Is Spark doing (1) then (2) then (3) then (4) ... or is it by default doing > (1) then (2) then branching to both (3) and (4) simultaneously and running > both in parallel? If not, how can I make that happen? > > Jon > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >