Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Sameer Choudhary
I the above setup my executors start one docker container per task. Some of
these containers grow in memory as data is piped. Eventually there is not
enough memory on the machine for docker containers to run (since YARN
already started its containers), and everything starts failing.

The way I'm planning is solve this is by reducing the memory available for
YARN to manage by overriding EMR's default configuration. So, if my machine
has 264 GB of memory, I'll give 150GB to YARN to run Spark, and rest will
be for the docker containers. By default, YARN manages about 220GB of
memory for my instance type.

The only problem is this is very wasteful. Especially, if I want to have a
long running cluster where many users can run Spark jobs simultaneously. I
am eagerly waiting for YARN-3611 issue to be resolved.

Best,
Sameer

On Thu, Nov 24, 2016 at 1:30 PM Holden Karau  wrote:

> So if the process your communicating with from Spark isn't launched inside
> of its YARN container then it shouldn't be an issue - although it sounds
> like you maybe have multiple resource managers on the same machine which
> can sometimes lead to interesting/difficult states.
>
> On Thu, Nov 24, 2016 at 1:27 PM, Sameer Choudhary 
> wrote:
>
> Ok, that makes sense for processes directly launched via fork or exec from
> the task.
>
> However, in my case the nd that starts docker daemon starts the new
> process. This process runs in a docker container. Will the container use
> memory from YARN executor memory overhead, as well? How will YARN know that
> the container launched by the docker daemon is linked to an executor?
>
> Best,
> Sameer
>
> On Thu, Nov 24, 2016 at 1:59 AM Holden Karau  wrote:
>
> YARN will kill your processes if the child processes you start via PIPE
> consume too much memory, you can configured the amount of memory Spark
> leaves aside for other processes besides the JVM in the YARN containers
> with spark.yarn.executor.memoryOverhead.
>
> On Wed, Nov 23, 2016 at 10:38 PM, Sameer Choudhary 
> wrote:
>
> Hi,
>
> I am working on an Spark 1.6.2 application on YARN managed EMR cluster
> that uses RDD's pipe method to process my data. I start a light weight
> daemon process that starts processes for each task via pipes. This is
> to ensure that I don't run into
> https://issues.apache.org/jira/browse/SPARK-671.
>
> I'm running into Spark job failure due to task failures across the
> cluster. Following are the questions that I think would help in
> understanding the issue:
>
> - How does resource allocation in PySpark work? How does YARN and
> SPARK track the memory consumed by python processes launched on the
> worker nodes?
>
> - As an example, let's say SPARK started n tasks on a worker node.
> These n tasks start n processes via pipe. Memory for executors is
> already reserved during application launch. As the processes run their
> memory footprint grows and eventually there is not enough memory on
> the box. In this case how will YARN and SPARK behave? Will the
> executors be killed or my processes will kill, eventually killing the
> task? I think this could lead to cascading failures of tasks across
> cluster as retry attempts also fail, eventually leading to termination
> of SPARK job. Is there a way to avoid this?
>
> - When we define number of executors in my SparkConf, are they
> distributed evenly across my nodes? One approach to get around this
> problem would be to limit the number of executors on each host that
> YARN can launch. So we will manage the memory for piped processes
> outside of YARN. Is there way to avoid this?
>
> Thanks,
> Sameer
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
So if the process your communicating with from Spark isn't launched inside
of its YARN container then it shouldn't be an issue - although it sounds
like you maybe have multiple resource managers on the same machine which
can sometimes lead to interesting/difficult states.

On Thu, Nov 24, 2016 at 1:27 PM, Sameer Choudhary 
wrote:

> Ok, that makes sense for processes directly launched via fork or exec from
> the task.
>
> However, in my case the nd that starts docker daemon starts the new
> process. This process runs in a docker container. Will the container use
> memory from YARN executor memory overhead, as well? How will YARN know that
> the container launched by the docker daemon is linked to an executor?
>
> Best,
> Sameer
>
> On Thu, Nov 24, 2016 at 1:59 AM Holden Karau  wrote:
>
>> YARN will kill your processes if the child processes you start via PIPE
>> consume too much memory, you can configured the amount of memory Spark
>> leaves aside for other processes besides the JVM in the YARN containers
>> with spark.yarn.executor.memoryOverhead.
>>
>> On Wed, Nov 23, 2016 at 10:38 PM, Sameer Choudhary 
>> wrote:
>>
>> Hi,
>>
>> I am working on an Spark 1.6.2 application on YARN managed EMR cluster
>> that uses RDD's pipe method to process my data. I start a light weight
>> daemon process that starts processes for each task via pipes. This is
>> to ensure that I don't run into
>> https://issues.apache.org/jira/browse/SPARK-671.
>>
>> I'm running into Spark job failure due to task failures across the
>> cluster. Following are the questions that I think would help in
>> understanding the issue:
>>
>> - How does resource allocation in PySpark work? How does YARN and
>> SPARK track the memory consumed by python processes launched on the
>> worker nodes?
>>
>> - As an example, let's say SPARK started n tasks on a worker node.
>> These n tasks start n processes via pipe. Memory for executors is
>> already reserved during application launch. As the processes run their
>> memory footprint grows and eventually there is not enough memory on
>> the box. In this case how will YARN and SPARK behave? Will the
>> executors be killed or my processes will kill, eventually killing the
>> task? I think this could lead to cascading failures of tasks across
>> cluster as retry attempts also fail, eventually leading to termination
>> of SPARK job. Is there a way to avoid this?
>>
>> - When we define number of executors in my SparkConf, are they
>> distributed evenly across my nodes? One approach to get around this
>> problem would be to limit the number of executors on each host that
>> YARN can launch. So we will manage the memory for piped processes
>> outside of YARN. Is there way to avoid this?
>>
>> Thanks,
>> Sameer
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Sameer Choudhary
Ok, that makes sense for processes directly launched via fork or exec from
the task.

However, in my case the nd that starts docker daemon starts the new
process. This process runs in a docker container. Will the container use
memory from YARN executor memory overhead, as well? How will YARN know that
the container launched by the docker daemon is linked to an executor?

Best,
Sameer

On Thu, Nov 24, 2016 at 1:59 AM Holden Karau  wrote:

> YARN will kill your processes if the child processes you start via PIPE
> consume too much memory, you can configured the amount of memory Spark
> leaves aside for other processes besides the JVM in the YARN containers
> with spark.yarn.executor.memoryOverhead.
>
> On Wed, Nov 23, 2016 at 10:38 PM, Sameer Choudhary 
> wrote:
>
> Hi,
>
> I am working on an Spark 1.6.2 application on YARN managed EMR cluster
> that uses RDD's pipe method to process my data. I start a light weight
> daemon process that starts processes for each task via pipes. This is
> to ensure that I don't run into
> https://issues.apache.org/jira/browse/SPARK-671.
>
> I'm running into Spark job failure due to task failures across the
> cluster. Following are the questions that I think would help in
> understanding the issue:
>
> - How does resource allocation in PySpark work? How does YARN and
> SPARK track the memory consumed by python processes launched on the
> worker nodes?
>
> - As an example, let's say SPARK started n tasks on a worker node.
> These n tasks start n processes via pipe. Memory for executors is
> already reserved during application launch. As the processes run their
> memory footprint grows and eventually there is not enough memory on
> the box. In this case how will YARN and SPARK behave? Will the
> executors be killed or my processes will kill, eventually killing the
> task? I think this could lead to cascading failures of tasks across
> cluster as retry attempts also fail, eventually leading to termination
> of SPARK job. Is there a way to avoid this?
>
> - When we define number of executors in my SparkConf, are they
> distributed evenly across my nodes? One approach to get around this
> problem would be to limit the number of executors on each host that
> YARN can launch. So we will manage the memory for piped processes
> outside of YARN. Is there way to avoid this?
>
> Thanks,
> Sameer
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
YARN will kill your processes if the child processes you start via PIPE
consume too much memory, you can configured the amount of memory Spark
leaves aside for other processes besides the JVM in the YARN containers
with spark.yarn.executor.memoryOverhead.

On Wed, Nov 23, 2016 at 10:38 PM, Sameer Choudhary 
wrote:

> Hi,
>
> I am working on an Spark 1.6.2 application on YARN managed EMR cluster
> that uses RDD's pipe method to process my data. I start a light weight
> daemon process that starts processes for each task via pipes. This is
> to ensure that I don't run into
> https://issues.apache.org/jira/browse/SPARK-671.
>
> I'm running into Spark job failure due to task failures across the
> cluster. Following are the questions that I think would help in
> understanding the issue:
>
> - How does resource allocation in PySpark work? How does YARN and
> SPARK track the memory consumed by python processes launched on the
> worker nodes?
>
> - As an example, let's say SPARK started n tasks on a worker node.
> These n tasks start n processes via pipe. Memory for executors is
> already reserved during application launch. As the processes run their
> memory footprint grows and eventually there is not enough memory on
> the box. In this case how will YARN and SPARK behave? Will the
> executors be killed or my processes will kill, eventually killing the
> task? I think this could lead to cascading failures of tasks across
> cluster as retry attempts also fail, eventually leading to termination
> of SPARK job. Is there a way to avoid this?
>
> - When we define number of executors in my SparkConf, are they
> distributed evenly across my nodes? One approach to get around this
> problem would be to limit the number of executors on each host that
> YARN can launch. So we will manage the memory for piped processes
> outside of YARN. Is there way to avoid this?
>
> Thanks,
> Sameer
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Yarn resource utilization with Spark pipe()

2016-11-23 Thread Sameer Choudhary
Hi,

I am working on an Spark 1.6.2 application on YARN managed EMR cluster
that uses RDD's pipe method to process my data. I start a light weight
daemon process that starts processes for each task via pipes. This is
to ensure that I don't run into
https://issues.apache.org/jira/browse/SPARK-671.

I'm running into Spark job failure due to task failures across the
cluster. Following are the questions that I think would help in
understanding the issue:

- How does resource allocation in PySpark work? How does YARN and
SPARK track the memory consumed by python processes launched on the
worker nodes?

- As an example, let's say SPARK started n tasks on a worker node.
These n tasks start n processes via pipe. Memory for executors is
already reserved during application launch. As the processes run their
memory footprint grows and eventually there is not enough memory on
the box. In this case how will YARN and SPARK behave? Will the
executors be killed or my processes will kill, eventually killing the
task? I think this could lead to cascading failures of tasks across
cluster as retry attempts also fail, eventually leading to termination
of SPARK job. Is there a way to avoid this?

- When we define number of executors in my SparkConf, are they
distributed evenly across my nodes? One approach to get around this
problem would be to limit the number of executors on each host that
YARN can launch. So we will manage the memory for piped processes
outside of YARN. Is there way to avoid this?

Thanks,
Sameer

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: Yarn resource utilization with Spark pipe()

2016-11-20 Thread Sameer Choudhary
Hi,

I am working on an Spark 1.6.2 application on YARN managed EMR cluster that
uses RDD's pipe method to process my data. I start a light weight daemon
process that starts processes for each task via pipes. This is to ensure
that I don't run into https://issues.apache.org/jira/browse/SPARK-671.

I'm running into Spark job failure due to task failures across the cluster.
Following are the questions that I think would help in understanding the
issue:
- How does resource allocation in PySpark work? How does YARN and SPARK
track the memory consumed by python processes launched on the worker nodes?
- As an example, let's say SPARK started n tasks on a worker node. These n
tasks start n processes via pipe. Memory for executors is already reserved
during application launch. As the processes run their memory footprint
grows and eventually there is not enough memory on the box. In this case
how will YARN and SPARK behave? Will the executors be killed or my
processes will kill, eventually killing the task? I think this could lead
to cascading failures of tasks across cluster as retry attempts also fail,
eventually leading to termination of SPARK job. Is there a way to avoid
this?
- When we define number of executors in my SparkConf, are they distributed
evenly across my nodes? One approach to get around this problem would be to
limit the number of executors on each host that YARN can launch. So we will
manage the memory for piped processes outside of YARN. Is there way to
avoid this?

Thanks,
Sameer