Hi Rares, If you dig into the descriptions for the two jobs, it will probably return something like:
Job ID: 1 org.apache.spark.rdd.RDD.takeSample(RDD.scala:447) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22) ... Job ID: 0 org.apache.spark.rdd.RDD.takeSample(RDD.scala:428) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22) ... The code for Spark from the git copy of master at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala Basically, line 428 refers to val initialCount = this.count() And liine 447 refers to var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect() Basically, the first job is getting the count so you can do the second job which is to generate the samples. HTH! Denny On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica <rvern...@gmail.com> wrote: > Hello, > > I am using takeSample from the Scala Spark 1.2.1 shell: > > scala> sc.textFile("README.md").takeSample(false, 3) > > > and I notice that two jobs are generated on the Spark Jobs page: > > Job Id Description > 1 takeSample at <console>:13 > 0 takeSample at <console>:13 > > > Any ideas why the two jobs are needed? > > Thanks! > Rares >