Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915

On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas <evan.sama...@gmail.com> wrote:
> It's true that it is an implementation detail, but it's a very important one
> to document because it has the possibility of changing results depending on
> when I use take or collect.  The issue I was running in to was when the
> executor had a different operating system than the driver, and I was using
> 'pipe' with a binary I compiled myself.  I needed to make sure I used the
> binary compiled for the operating system I expect it to run on.  So in cases
> where I was only interested in the first value, my code was breaking
> horribly on 1.0.2, but working fine on 1.1.
>
> My only suggestion would be to backport 'spark.localExecution.enabled' to
> the 1.0 line.  Thanks for all your help!
>
> Evan
>
> On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> This is some kind of implementation details, so not documented :-(
>>
>> If you think this is a blocker for you, you could create a JIRA, maybe
>> it's could be fixed in 1.0.3+.
>>
>> Davies
>>
>> On Fri, Oct 10, 2014 at 5:11 PM, Evan <evan.sama...@gmail.com> wrote:
>> > Thank you!  I was looking for a config variable to that end, but I was
>> > looking in Spark 1.0.2 documentation, since that was the version I had
>> > the
>> > problem with.  Is this behavior documented in 1.0.2's documentation?
>> >
>> > Evan
>> >
>> > On 10/09/2014 04:12 PM, Davies Liu wrote:
>> >>
>> >> When you call rdd.take() or rdd.first(), it may[1] executor the job
>> >> locally (in driver),
>> >> otherwise, all the jobs are executed in cluster.
>> >>
>> >> There is config called `spark.localExecution.enabled` (since 1.1+) to
>> >> change this,
>> >> it's not enabled by default, so all the functions will be executed in
>> >> cluster.
>> >> If you change set this to `true`, then you get the same behavior as
>> >> 1.0.
>> >>
>> >> [1] If it did not get enough items from the first partitions, it will
>> >> try multiple partitions
>> >> in a time, so they will be executed in cluster.
>> >>
>> >> On Thu, Oct 9, 2014 at 12:14 PM, esamanas <evan.sama...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am using pyspark and I'm trying to support both Spark 1.0.2 and
>> >>> 1.1.0
>> >>> with
>> >>> my app, which will run in yarn-client mode.  However, it appears when
>> >>> I
>> >>> use
>> >>> 'map' to run a python lambda function over an RDD, they appear to be
>> >>> run
>> >>> on
>> >>> different machines, and this is causing problems.
>> >>>
>> >>> In both cases, I am using a Hadoop cluster that runs linux on all of
>> >>> its
>> >>> nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.
>> >>> As
>> >>> a
>> >>> reproducer, here is my script:
>> >>>
>> >>> import platform
>> >>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
>> >>>
>> >>> The answer in Spark 1.1.0:
>> >>> 'Linux'
>> >>>
>> >>> The answer in Spark 1.0.2:
>> >>> 'Darwin'
>> >>>
>> >>> In other experiments I changed the size of the list that gets
>> >>> parallelized,
>> >>> thinking maybe 1.0.2 just runs jobs on the driver node if they're
>> >>> small
>> >>> enough.  I got the same answer (with only 1 million numbers).
>> >>>
>> >>> This is a troubling difference.  I would expect all functions run on
>> >>> an
>> >>> RDD
>> >>> to be executed on my worker nodes in the Hadoop cluster, but this is
>> >>> clearly
>> >>> not the case for 1.0.2.  Why does this difference exist?  How can I
>> >>> accurately detect which jobs will run where?
>> >>>
>> >>> Thank you,
>> >>>
>> >>> Evan
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to