Re: takeSample triggers 2 jobs

Denny Lee Fri, 06 Mar 2015 11:12:22 -0800

Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:


Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica <rvern...@gmail.com> wrote:

> Hello,
>
> I am using takeSample from the Scala Spark 1.2.1 shell:
>
> scala> sc.textFile("README.md").takeSample(false, 3)
>
>
> and I notice that two jobs are generated on the Spark Jobs page:
>
> Job Id Description
> 1 takeSample at <console>:13
> 0  takeSample at <console>:13
>
>
> Any ideas why the two jobs are needed?
>
> Thanks!
> Rares
>

Re: takeSample triggers 2 jobs

Reply via email to