This is some kind of implementation details, so not documented :-( If you think this is a blocker for you, you could create a JIRA, maybe it's could be fixed in 1.0.3+.
Davies On Fri, Oct 10, 2014 at 5:11 PM, Evan <evan.sama...@gmail.com> wrote: > Thank you! I was looking for a config variable to that end, but I was > looking in Spark 1.0.2 documentation, since that was the version I had the > problem with. Is this behavior documented in 1.0.2's documentation? > > Evan > > On 10/09/2014 04:12 PM, Davies Liu wrote: >> >> When you call rdd.take() or rdd.first(), it may[1] executor the job >> locally (in driver), >> otherwise, all the jobs are executed in cluster. >> >> There is config called `spark.localExecution.enabled` (since 1.1+) to >> change this, >> it's not enabled by default, so all the functions will be executed in >> cluster. >> If you change set this to `true`, then you get the same behavior as 1.0. >> >> [1] If it did not get enough items from the first partitions, it will >> try multiple partitions >> in a time, so they will be executed in cluster. >> >> On Thu, Oct 9, 2014 at 12:14 PM, esamanas <evan.sama...@gmail.com> wrote: >>> >>> Hi, >>> >>> I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 >>> with >>> my app, which will run in yarn-client mode. However, it appears when I >>> use >>> 'map' to run a python lambda function over an RDD, they appear to be run >>> on >>> different machines, and this is causing problems. >>> >>> In both cases, I am using a Hadoop cluster that runs linux on all of its >>> nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As >>> a >>> reproducer, here is my script: >>> >>> import platform >>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] >>> >>> The answer in Spark 1.1.0: >>> 'Linux' >>> >>> The answer in Spark 1.0.2: >>> 'Darwin' >>> >>> In other experiments I changed the size of the list that gets >>> parallelized, >>> thinking maybe 1.0.2 just runs jobs on the driver node if they're small >>> enough. I got the same answer (with only 1 million numbers). >>> >>> This is a troubling difference. I would expect all functions run on an >>> RDD >>> to be executed on my worker nodes in the Hadoop cluster, but this is >>> clearly >>> not the case for 1.0.2. Why does this difference exist? How can I >>> accurately detect which jobs will run where? >>> >>> Thank you, >>> >>> Evan >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org