sortByKey() runs one job to sample the data, to determine what range of
keys to put in each partition.

There is a jira to change it to defer launching the job until the
subsequent action, but it will still execute another stage:

https://issues.apache.org/jira/browse/SPARK-1021

On Wed, Apr 29, 2015 at 5:57 PM, Tom Hubregtsen <thubregt...@gmail.com>
wrote:

> "I'm not sure, but I wonder if because you are using the Spark REPL that it
> may not be representing what a normal runtime execution would look like and
> is possibly eagerly running a partial DAG once you define an operation that
> would cause a shuffle.
>
> What happens if you setup your same set of commands [a-e] in a file and use
> the Spark REPL's `load` or `paste` command to load them all at once?" From
> Richard
>
> I have also packaged it in a jar file (without [e], the debug string), and
> still see the extra stage before the other two that I would expect. Even
> when I remove [d], the action, I still see stage 0 being executed (and do
> not see stage 1 and 2).
>
> Again a shortened log of the Stage 0:
> INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at
> sortByKey, which has no missing parents
> INFO DAGScheduler: ResultStage 0 (sortByKey) finished in 0.192 s
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation-with-an-action-tp22707p22713.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to