This is some kind of implementation details, so not documented :-(

If you think this is a blocker for you, you could create a JIRA, maybe
it's could be fixed in 1.0.3+.

Davies

On Fri, Oct 10, 2014 at 5:11 PM, Evan <evan.sama...@gmail.com> wrote:
> Thank you!  I was looking for a config variable to that end, but I was
> looking in Spark 1.0.2 documentation, since that was the version I had the
> problem with.  Is this behavior documented in 1.0.2's documentation?
>
> Evan
>
> On 10/09/2014 04:12 PM, Davies Liu wrote:
>>
>> When you call rdd.take() or rdd.first(), it may[1] executor the job
>> locally (in driver),
>> otherwise, all the jobs are executed in cluster.
>>
>> There is config called `spark.localExecution.enabled` (since 1.1+) to
>> change this,
>> it's not enabled by default, so all the functions will be executed in
>> cluster.
>> If you change set this to `true`, then you get the same behavior as 1.0.
>>
>> [1] If it did not get enough items from the first partitions, it will
>> try multiple partitions
>> in a time, so they will be executed in cluster.
>>
>> On Thu, Oct 9, 2014 at 12:14 PM, esamanas <evan.sama...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
>>> with
>>> my app, which will run in yarn-client mode.  However, it appears when I
>>> use
>>> 'map' to run a python lambda function over an RDD, they appear to be run
>>> on
>>> different machines, and this is causing problems.
>>>
>>> In both cases, I am using a Hadoop cluster that runs linux on all of its
>>> nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As
>>> a
>>> reproducer, here is my script:
>>>
>>> import platform
>>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
>>>
>>> The answer in Spark 1.1.0:
>>> 'Linux'
>>>
>>> The answer in Spark 1.0.2:
>>> 'Darwin'
>>>
>>> In other experiments I changed the size of the list that gets
>>> parallelized,
>>> thinking maybe 1.0.2 just runs jobs on the driver node if they're small
>>> enough.  I got the same answer (with only 1 million numbers).
>>>
>>> This is a troubling difference.  I would expect all functions run on an
>>> RDD
>>> to be executed on my worker nodes in the Hadoop cluster, but this is
>>> clearly
>>> not the case for 1.0.2.  Why does this difference exist?  How can I
>>> accurately detect which jobs will run where?
>>>
>>> Thank you,
>>>
>>> Evan
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to