Hi, 
This is a 4 node hadoop cluster running on CentOS 6.3 with Oracle JDK (64bit) 
1.6.0_43. Each node has 32G memory, with max 8 mapper tasks and 4 reducer tasks 
being set. The hadoop version is 1.0.4.
This is setup on Datastax DES 3.0.2, which is using Cassandra CFS as underline 
DFS, instead of HDFS with NameNode. I understand this kind of setting is not 
really being tested with hadoop MR, but the above MR errors should not relate 
to it, at least from my guess.
I am running a simple MR job, partition data by DATE for 700G of 600 files. The 
MR logic is very straightforward, but in our above staging environment, I saw a 
lot of Reducers failed with the above error. I want to know the reason and fix 
it.
1) There is no log related to this error in the reducer task attempt log in 
user log directory. The only log related to this is in the system.log, which 
generated by cassandra processor:     INFO [JVM Runner 
jvm_201308141528_0003_r_625176200 spawned.] 2013-08-15 07:28:59,326 
JvmManager.java (line 510) JVM : jvm_201308141528_0003_r_625176200 exited with 
exit code -1. Number of tasks it ran: 0
2) I believe this error is related to the system resource, but just cannot 
google anything to be the root cause. From the log, I believe the JVM 
terminated/crashed for the reducer task, but I don't know the reason. 
3) I checked the limits of the user which process is running under, here is the 
info, and I didn't spot any obvious problems.-bash-4.1$ ulimit -acore file size 
         (blocks, -c) 0data seg size           (kbytes, -d) unlimitedscheduling 
priority             (-e) 0file size               (blocks, -f) 
unlimitedpending signals                 (-i) 256589max locked memory       
(kbytes, -l) unlimitedmax memory size         (kbytes, -m) unlimitedopen files  
                    (-n) 400000pipe size            (512 bytes, -p) 8POSIX 
message queues     (bytes, -q) 819200real-time priority              (-r) 
0stack size              (kbytes, -s) 10240cpu time               (seconds, -t) 
unlimitedmax user processes              (-u) 32768virtual memory          
(kbytes, -v) unlimitedfile locks                      (-x) unlimited
4) Since this is a new cluster, there is really not too much hadoop setting 
changed from the default value. I did run the reducer as '-mx2048m', to set the 
heap size of JVM to 2G, as 1st time the reducers failed with OOM error. I 
google around, as it looks like people recommend to set "mapred.child.ulimit" 
to 3x of heap size, which should be around 6G in this case. I can give that a 
try, but in the nodes, the virtual memory is set to unlimited for user whom is 
running under, so I am not sure if this will really fix it.
5) Another possibility I found in google is that the child process return -1 
when it failed to write to user logs, as Linux EXT3 has a limitation about how 
many file/directories can be created under one folder (32k?). But my system is 
using EXT4, and there is not too many MR jobs running so far.
6) I am really not sure what is the root cause of this, as exit code -1 could 
mean a lot. But I wonder any one here can give me more hints, or any help about 
debugging this issue in my environment? Is there any way in hapoop or JVM 
setting I can set to dump more info/log about why the JVM terminated at runtime 
with exit code -1?
Thanks
Yong                                      

Reply via email to