Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation?

Evan


On 10/09/2014 04:12 PM, Davies Liu wrote:
When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in cluster.
If you change set this to `true`, then you get the same behavior as 1.0.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas <evan.sama...@gmail.com> wrote:
Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with
my app, which will run in yarn-client mode.  However, it appears when I use
'map' to run a python lambda function over an RDD, they appear to be run on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of its
nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're small
enough.  I got the same answer (with only 1 million numbers).

This is a troubling difference.  I would expect all functions run on an RDD
to be executed on my worker nodes in the Hadoop cluster, but this is clearly
not the case for 1.0.2.  Why does this difference exist?  How can I
accurately detect which jobs will run where?

Thank you,

Evan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to