Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Sameer Choudhary
I the above setup my executors start one docker container per task. Some of these containers grow in memory as data is piped. Eventually there is not enough memory on the machine for docker containers to run (since YARN already started its containers), and everything starts failing. The way I'm

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
So if the process your communicating with from Spark isn't launched inside of its YARN container then it shouldn't be an issue - although it sounds like you maybe have multiple resource managers on the same machine which can sometimes lead to interesting/difficult states. On Thu, Nov 24, 2016 at

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Sameer Choudhary
Ok, that makes sense for processes directly launched via fork or exec from the task. However, in my case the nd that starts docker daemon starts the new process. This process runs in a docker container. Will the container use memory from YARN executor memory overhead, as well? How will YARN know

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
YARN will kill your processes if the child processes you start via PIPE consume too much memory, you can configured the amount of memory Spark leaves aside for other processes besides the JVM in the YARN containers with spark.yarn.executor.memoryOverhead. On Wed, Nov 23, 2016 at 10:38 PM, Sameer

Yarn resource utilization with Spark pipe()

2016-11-23 Thread Sameer Choudhary
Hi, I am working on an Spark 1.6.2 application on YARN managed EMR cluster that uses RDD's pipe method to process my data. I start a light weight daemon process that starts processes for each task via pipes. This is to ensure that I don't run into https://issues.apache.org/jira/browse/SPARK-671.

Fwd: Yarn resource utilization with Spark pipe()

2016-11-20 Thread Sameer Choudhary
Hi, I am working on an Spark 1.6.2 application on YARN managed EMR cluster that uses RDD's pipe method to process my data. I start a light weight daemon process that starts processes for each task via pipes. This is to ensure that I don't run into https://issues.apache.org/jira/browse/SPARK-671.