Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915
On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas <evan.sama...@gmail.com> wrote: > It's true that it is an implementation detail, but it's a very important one > to document because it has the possibility of changing results depending on > when I use take or collect. The issue I was running in to was when the > executor had a different operating system than the driver, and I was using > 'pipe' with a binary I compiled myself. I needed to make sure I used the > binary compiled for the operating system I expect it to run on. So in cases > where I was only interested in the first value, my code was breaking > horribly on 1.0.2, but working fine on 1.1. > > My only suggestion would be to backport 'spark.localExecution.enabled' to > the 1.0 line. Thanks for all your help! > > Evan > > On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu <dav...@databricks.com> wrote: >> >> This is some kind of implementation details, so not documented :-( >> >> If you think this is a blocker for you, you could create a JIRA, maybe >> it's could be fixed in 1.0.3+. >> >> Davies >> >> On Fri, Oct 10, 2014 at 5:11 PM, Evan <evan.sama...@gmail.com> wrote: >> > Thank you! I was looking for a config variable to that end, but I was >> > looking in Spark 1.0.2 documentation, since that was the version I had >> > the >> > problem with. Is this behavior documented in 1.0.2's documentation? >> > >> > Evan >> > >> > On 10/09/2014 04:12 PM, Davies Liu wrote: >> >> >> >> When you call rdd.take() or rdd.first(), it may[1] executor the job >> >> locally (in driver), >> >> otherwise, all the jobs are executed in cluster. >> >> >> >> There is config called `spark.localExecution.enabled` (since 1.1+) to >> >> change this, >> >> it's not enabled by default, so all the functions will be executed in >> >> cluster. >> >> If you change set this to `true`, then you get the same behavior as >> >> 1.0. >> >> >> >> [1] If it did not get enough items from the first partitions, it will >> >> try multiple partitions >> >> in a time, so they will be executed in cluster. >> >> >> >> On Thu, Oct 9, 2014 at 12:14 PM, esamanas <evan.sama...@gmail.com> >> >> wrote: >> >>> >> >>> Hi, >> >>> >> >>> I am using pyspark and I'm trying to support both Spark 1.0.2 and >> >>> 1.1.0 >> >>> with >> >>> my app, which will run in yarn-client mode. However, it appears when >> >>> I >> >>> use >> >>> 'map' to run a python lambda function over an RDD, they appear to be >> >>> run >> >>> on >> >>> different machines, and this is causing problems. >> >>> >> >>> In both cases, I am using a Hadoop cluster that runs linux on all of >> >>> its >> >>> nodes. I am submitting my jobs with a machine running Mac OS X 10.9. >> >>> As >> >>> a >> >>> reproducer, here is my script: >> >>> >> >>> import platform >> >>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] >> >>> >> >>> The answer in Spark 1.1.0: >> >>> 'Linux' >> >>> >> >>> The answer in Spark 1.0.2: >> >>> 'Darwin' >> >>> >> >>> In other experiments I changed the size of the list that gets >> >>> parallelized, >> >>> thinking maybe 1.0.2 just runs jobs on the driver node if they're >> >>> small >> >>> enough. I got the same answer (with only 1 million numbers). >> >>> >> >>> This is a troubling difference. I would expect all functions run on >> >>> an >> >>> RDD >> >>> to be executed on my worker nodes in the Hadoop cluster, but this is >> >>> clearly >> >>> not the case for 1.0.2. Why does this difference exist? How can I >> >>> accurately detect which jobs will run where? >> >>> >> >>> Thank you, >> >>> >> >>> Evan >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> >>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html >> >>> Sent from the Apache Spark User List mailing list archive at >> >>> Nabble.com. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >>> For additional commands, e-mail: user-h...@spark.apache.org >> >>> >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org