So I'm running PySpark 1.3.1 on Amazon EMR on a fairly beefy cluster (20 node cluster with 32 cores each node and 64 gig memory) and my parallelism, executor.instances, executor.cores and executor memory settings are also fairly reasonable (600, 20, 30, 48gigs).
However my job invariably fails when trying to use a 200MB broadcast in a closure as YARN starts killing containers for running beyond physical memory limits. Looking at my node manager logs on slaves, it seems that PySpark is spawning too many pyspark daemons which are using up more than the off-heap memory would allow and playing with yarn.executor.memoryOverhead property doesnt seem to make much of a difference. Has anyone else come across this before? - Amey -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-failing-on-a-mid-sized-broadcast-tp25520.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org