Segfault during Python streaming job

Kai Ju Liu Wed, 21 Sep 2011 10:56:36 -0700

Hi. I've been having the following issue recently with one of my Python
streaming jobs. This job interacts with a MySQL database in the reducer
steps via gevent and a trusted MySQL client library. Reducers will fail with
the following error message, and given enough failures, the job will of
course fail.


2011-09-21 07:14:21,750 WARN org.apache.hadoop.mapred.TaskTracker (main):
Error running child
java.io.IOException: subprocess exited with error code 139
R/W/S=7468/0/0 in:414=7468/18 [rec/s] out:0=0/18 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Wed Sep 21 07:14:21 UTC 2011
Broken pipe
    at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:131)
    at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-09-21 07:14:21,753 INFO org.apache.hadoop.mapred.TaskRunner (main):
Runnning cleanup for the task

Does this error message indicate an actual segfault in the reducer tasks?

I've been working under the assumption that the reducer tasks are indeed
segfaulting, but I also haven't been able to retrieve any coredumps. I've
set up the proper core limit on all nodes as well as a fixed coredump
location so coredumps won't be automatically cleaned up by the tasktracker.
I've also run simple streaming jobs with Python sleeps and manually issued
the "kill -11" signal to the Python processes. In all tests, I've seen the
same error message as above and coredumps in the expected locations.

Is there anything else that must be configured in order to retrieve
coredumps from streaming jobs?

Thanks!
Kai Ju

Segfault during Python streaming job

Reply via email to