Hi, I finally solved the problem by setting spark.yarn.executor.memoryOverhead with the option --conf "spark.yarn.executor.memoryOverhead=xxxx" for spark-submit, as pointed out in http://stackoverflow.com/questions/28404714/yarn-why-doesnt-task-go-out-of-heap-space-but-container-gets-killed and https://issues.apache.org/jira/browse/SPARK-2444, and now it works ok.
Greetings, Juan 2015-02-23 10:40 GMT+01:00 Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com>: > Hi, > > I'm having problems using pipe() from a Spark program written in Java, > where I call a python script, running in a YARN cluster. The problem is > that the job fails when YARN kills the container because the python script > is going beyond the memory limits. I get something like this in the log: > > 01_000004. Exit status: 143. Diagnostics: Container > [pid=6976,containerID=container_1424279690678_0078_01_000004] is running > beyond physical memory limits. Current usage: 7.5 GB of 7.5 GB physical > memory used; 8.6 GB of 23.3 GB virtual memory used. Killing container. > Dump of the process-tree for container_1424279690678_0078_01_000004 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 6976 1457 6976 6976 (bash) 0 0 108613632 338 /bin/bash -c > /usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError='kill %p' > -Xms7048m -Xmx7048m > -Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_000004/tmp > '-Dspark.driver.port=33589' '-Dspark.ui.port=0' > -Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004 > org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp:// > sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5 > slave1.lambdoop.com 1 application_1424279690678_0078 1> > /mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004/stdout > 2> > /mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004/stderr > > |- 10513 6982 6976 6976 (python2.7) 9308 1224 448360448 13857 > /usr/local/bin/python2.7 /mnt/my_script.py my_args > |- 6982 6976 6976 6976 (java) 115176 12032 8632229888 1951974 > /usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError=kill %p > -Xms7048m -Xmx7048m > -Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_000004/tmp > -Dspark.driver.port=33589 -Dspark.ui.port=0 > -Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004 > org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp:// > sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5 > slave1.lambdoop.com 1 application_1424279690678_0078 > > Container killed on request. Exit code is 143 > Container exited with a non-zero exit code 143 > > I find this strange because the python script process each input line > separately, and makes a simple independent calculation per line: it > basically parses the line, calcules the Haversine distance, and returns a > double value. Input lines are traversed in python with a loop "for line in > sys.stdin". Also to avoid memory leaks in Python: > - I call sys.stdout.flush() per each output line generated by python. > - I call the following function after writing each output line, to force > garbage collection regularly in Python: > > _iterations_until_gc = 1000 > iterations_since_gc = 0 > def update_garbage_collector(): > global iterations_since_gc > if iterations_since_gc >= _iterations_until_gc: > gc.collect() > iterations_since_gc = 0 > else: > iterations_since_gc += 1 > > So the memory consumption of the script should be constant, but in > practice it looks like there is some memory leak, maybe Spark is > introducing some memory leak when redirecting the IO in pipe()? Has any of > you experienced similar situations when using pipe in Spark? Also, do you > know how could I control the amount of memory reserved for the subprocess > that is created by pipe. I understand than with --executor-memory I set the > memory for the Spark executor process, but not for the subprocess created > by pipe. > > Thanks in advance for your help. > > Greetings, > > Juan > >