Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out:
val rdd1 = sc.parallelize(1 to 100000) val rdd2 = sc.parallelize(1 to 200000) var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par thingsToDo.foreach { case(rdd, index) => for(i <- (1 to 10000)) logger.info(s"Index ${index} - ${rdd.sum()}") } This will run both operations in parallel. On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: > For the below code, since rdd1 and rdd2 dont depend on each other - i was > expecting that both first and second printlns would be interwoven. However > - > the spark job runs all "first " statements first and then all "seocnd" > statements next in serial fashion. I have set spark.scheduler.mode = FAIR. > obviously my understanding of parallel stages is wrong. What am I missing? > > val rdd1 = sc.parallelize(1 to 1000000) > val rdd2 = sc.parallelize(1 to 1000000) > > for (i <- (1 to 100)) > println("first: " + rdd1.sum()) > for (i <- (1 to 100)) > println("second" + rdd2.sum()) > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >