Hi,

I am working on an Spark 1.6.2 application on YARN managed EMR cluster
that uses RDD's pipe method to process my data. I start a light weight
daemon process that starts processes for each task via pipes. This is
to ensure that I don't run into
https://issues.apache.org/jira/browse/SPARK-671.

I'm running into Spark job failure due to task failures across the
cluster. Following are the questions that I think would help in
understanding the issue:

- How does resource allocation in PySpark work? How does YARN and
SPARK track the memory consumed by python processes launched on the
worker nodes?

- As an example, let's say SPARK started n tasks on a worker node.
These n tasks start n processes via pipe. Memory for executors is
already reserved during application launch. As the processes run their
memory footprint grows and eventually there is not enough memory on
the box. In this case how will YARN and SPARK behave? Will the
executors be killed or my processes will kill, eventually killing the
task? I think this could lead to cascading failures of tasks across
cluster as retry attempts also fail, eventually leading to termination
of SPARK job. Is there a way to avoid this?

- When we define number of executors in my SparkConf, are they
distributed evenly across my nodes? One approach to get around this
problem would be to limit the number of executors on each host that
YARN can launch. So we will manage the memory for piped processes
outside of YARN. Is there way to avoid this?

Thanks,
Sameer

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to