It's true that it is an implementation detail, but it's a very important
one to document because it has the possibility of changing results
depending on when I use take or collect.  The issue I was running in to was
when the executor had a different operating system than the driver, and I
was using 'pipe' with a binary I compiled myself.  I needed to make sure I
used the binary compiled for the operating system I expect it to run on.
So in cases where I was only interested in the first value, my code was
breaking horribly on 1.0.2, but working fine on 1.1.

My only suggestion would be to backport 'spark.localExecution.enabled' to
the 1.0 line.  Thanks for all your help!

Evan

On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu <dav...@databricks.com> wrote:

> This is some kind of implementation details, so not documented :-(
>
> If you think this is a blocker for you, you could create a JIRA, maybe
> it's could be fixed in 1.0.3+.
>
> Davies
>
> On Fri, Oct 10, 2014 at 5:11 PM, Evan <evan.sama...@gmail.com> wrote:
> > Thank you!  I was looking for a config variable to that end, but I was
> > looking in Spark 1.0.2 documentation, since that was the version I had
> the
> > problem with.  Is this behavior documented in 1.0.2's documentation?
> >
> > Evan
> >
> > On 10/09/2014 04:12 PM, Davies Liu wrote:
> >>
> >> When you call rdd.take() or rdd.first(), it may[1] executor the job
> >> locally (in driver),
> >> otherwise, all the jobs are executed in cluster.
> >>
> >> There is config called `spark.localExecution.enabled` (since 1.1+) to
> >> change this,
> >> it's not enabled by default, so all the functions will be executed in
> >> cluster.
> >> If you change set this to `true`, then you get the same behavior as 1.0.
> >>
> >> [1] If it did not get enough items from the first partitions, it will
> >> try multiple partitions
> >> in a time, so they will be executed in cluster.
> >>
> >> On Thu, Oct 9, 2014 at 12:14 PM, esamanas <evan.sama...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
> >>> with
> >>> my app, which will run in yarn-client mode.  However, it appears when I
> >>> use
> >>> 'map' to run a python lambda function over an RDD, they appear to be
> run
> >>> on
> >>> different machines, and this is causing problems.
> >>>
> >>> In both cases, I am using a Hadoop cluster that runs linux on all of
> its
> >>> nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.
> As
> >>> a
> >>> reproducer, here is my script:
> >>>
> >>> import platform
> >>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
> >>>
> >>> The answer in Spark 1.1.0:
> >>> 'Linux'
> >>>
> >>> The answer in Spark 1.0.2:
> >>> 'Darwin'
> >>>
> >>> In other experiments I changed the size of the list that gets
> >>> parallelized,
> >>> thinking maybe 1.0.2 just runs jobs on the driver node if they're small
> >>> enough.  I got the same answer (with only 1 million numbers).
> >>>
> >>> This is a troubling difference.  I would expect all functions run on an
> >>> RDD
> >>> to be executed on my worker nodes in the Hadoop cluster, but this is
> >>> clearly
> >>> not the case for 1.0.2.  Why does this difference exist?  How can I
> >>> accurately detect which jobs will run where?
> >>>
> >>> Thank you,
> >>>
> >>> Evan
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >
>

Reply via email to