Re: Yarn resource utilization with Spark pipe()

Holden Karau Thu, 24 Nov 2016 13:30:27 -0800

So if the process your communicating with from Spark isn't launched inside
of its YARN container then it shouldn't be an issue - although it sounds
like you maybe have multiple resource managers on the same machine which
can sometimes lead to interesting/difficult states.


On Thu, Nov 24, 2016 at 1:27 PM, Sameer Choudhary <sameer2...@gmail.com>
wrote:

> Ok, that makes sense for processes directly launched via fork or exec from
> the task.
>
> However, in my case the nd that starts docker daemon starts the new
> process. This process runs in a docker container. Will the container use
> memory from YARN executor memory overhead, as well? How will YARN know that
> the container launched by the docker daemon is linked to an executor?
>
> Best,
> Sameer
>
> On Thu, Nov 24, 2016 at 1:59 AM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> YARN will kill your processes if the child processes you start via PIPE
>> consume too much memory, you can configured the amount of memory Spark
>> leaves aside for other processes besides the JVM in the YARN containers
>> with spark.yarn.executor.memoryOverhead.
>>
>> On Wed, Nov 23, 2016 at 10:38 PM, Sameer Choudhary <sameer2...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I am working on an Spark 1.6.2 application on YARN managed EMR cluster
>> that uses RDD's pipe method to process my data. I start a light weight
>> daemon process that starts processes for each task via pipes. This is
>> to ensure that I don't run into
>> https://issues.apache.org/jira/browse/SPARK-671.
>>
>> I'm running into Spark job failure due to task failures across the
>> cluster. Following are the questions that I think would help in
>> understanding the issue:
>>
>> - How does resource allocation in PySpark work? How does YARN and
>> SPARK track the memory consumed by python processes launched on the
>> worker nodes?
>>
>> - As an example, let's say SPARK started n tasks on a worker node.
>> These n tasks start n processes via pipe. Memory for executors is
>> already reserved during application launch. As the processes run their
>> memory footprint grows and eventually there is not enough memory on
>> the box. In this case how will YARN and SPARK behave? Will the
>> executors be killed or my processes will kill, eventually killing the
>> task? I think this could lead to cascading failures of tasks across
>> cluster as retry attempts also fail, eventually leading to termination
>> of SPARK job. Is there a way to avoid this?
>>
>> - When we define number of executors in my SparkConf, are they
>> distributed evenly across my nodes? One approach to get around this
>> problem would be to limit the number of executors on each host that
>> YARN can launch. So we will manage the memory for piped processes
>> outside of YARN. Is there way to avoid this?
>>
>> Thanks,
>> Sameer
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Yarn resource utilization with Spark pipe()

Reply via email to