In last 4-5 of day the task tracker on one of my slave machines has gone down couple of time. It has been working fine from the past 4-5 months
The cluster configuration is 4 machine cluster on AWS 1 m2.xlarge master 3 m2.xlarge slaves The cluster is dedicated to run hive queries, with the data residing on s3. the slave on which the task tracker went down had the following log ******************************************************************* 2013-06-11 00:26:30,968 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 279198 2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.191.**.***:37605, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 193135 2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60630, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 192011 2013-06-11 00:26:30,972 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 178209 2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 186452 2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 157360 2013-06-11 00:26:30,974 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 157555 2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM Not killed jvm_201306071409_0151_m_-435659475 but just removed 2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201306071409_0151_m_-435659475 exited with exit code 0. Number of tasks it ran: 0 2013-06-11 00:26:30,991 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. org.apache.hadoop.fs.FSError: java.io.IOException: Broken pipe at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:200) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:107) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:220) at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:315) at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:148) at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233) at java.io.BufferedWriter.close(BufferedWriter.java:265) at java.io.PrintWriter.close(PrintWriter.java:312) at org.apache.hadoop.mapred.TaskController.writeCommand(TaskController.java:231) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:126) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471) Caused by: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:198) ... 13 more 2013-06-11 00:26:31,007 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201306071409_0151_m_-495709221 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 222430 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60653, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 154027 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 132067 2013-06-11 00:26:31,326 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201306071409_0151_m_-495709221 spawned. 2013-06-11 00:26:31,328 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /mnt/app/hadoop-tmp/ttprivate/taskTracker/piyushv/jobcache/job_201306071409_0151/attempt_201306071409_0151_m_005717_0/taskjvm.sh 2013-06-11 00:26:31,331 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 437236 2013-06-11 00:26:31,332 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down TaskTracker at ip-10-191-**-***/10.191.**.*** ************************************************************/ -- RAVI SHETYE