Re: Map/Reduce Tasks Fails
Sandeep, Is the same DN 10.0.25.149 reported across all failures? And do you notice any machine patterns when observing the failed tasks (i.e. are they clumped on any one or a few particular TTs repeatedly)? On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi, We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort some of the map tasks are Failed/Killed and the logs show similar error on all machines. 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.0.25.149:50010 java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 remote=/10.0.25.149:50010] 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7260720956806950576_1825 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.0.25.149:50010 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201205211504_0007_m_16_1. Are these kind of errors common?? Atleast 1 map task is failing due to above reason on all the machines.We are using 24 mappers for teragen. For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers and 17failed/8 killed task attempts. 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. Cluster works good for small datasets. -- Harsh J
Re: Map/Reduce Tasks Fails
From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Tuesday, May 22, 2012 7:13 AM Subject: Re: Map/Reduce Tasks Fails Sandeep, Is the same DN 10.0.25.149 reported across all failures? And do you notice any machine patterns when observing the failed tasks (i.e. are they clumped on any one or a few particular TTs repeatedly)? On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi, We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort some of the map tasks are Failed/Killed and the logs show similar error on all machines. 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.0.25.149:50010 java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 remote=/10.0.25.149:50010] 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7260720956806950576_1825 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.0.25.149:50010 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201205211504_0007_m_16_1. Are these kind of errors common?? Atleast 1 map task is failing due to above reason on all the machines.We are using 24 mappers for teragen. For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers and 17failed/8 killed task attempts. 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. Cluster works good for small datasets. -- Harsh J
Re: Map/Reduce Tasks Fails
What kind of storage is attached to the data nodes ? This kind of error can happen when the CPU is really busy with I/O or interrupts. Can you run top or dstat on some of the data nodes to see how the system is performing? Raj From: Sandeep Reddy P sandeepreddy.3...@gmail.com To: common-user@hadoop.apache.org Sent: Tuesday, May 22, 2012 7:23 AM Subject: Re: Map/Reduce Tasks Fails *Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since Start* *Total Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded Tasks Last Hour* *Seconds since heartbeat* tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225http://hadoop2.liaisondevqa.local:50060/ hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0 tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363http://hadoop4.liaisondevqa.local:50060/ hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0 tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605http://hadoop5.liaisondevqa.local:50060/ hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0 tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305http://hadoop3.liaisondevqa.local:50060/ hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0 Highest Failures: tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22 failures
Re: Map/Reduce Tasks Fails
Seems like a question better suited for Cloudera lists... On May 22, 2012, at 7:02 AM, Sandeep Reddy P wrote: Hi, We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort some of the map tasks are Failed/Killed and the logs show similar error on all machines. 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.0.25.149:50010 java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 remote=/10.0.25.149:50010] 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7260720956806950576_1825 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.0.25.149:50010 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201205211504_0007_m_16_1. Are these kind of errors common?? Atleast 1 map task is failing due to above reason on all the machines.We are using 24 mappers for teragen. For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers and 17failed/8 killed task attempts. 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. Cluster works good for small datasets. -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Map/Reduce Tasks Fails
I got samilar errors for Apache Hadoop 1.0.0 Thanks, Sandeep.
Re: Map/Reduce Tasks Fails
Raj, Top from one datanode when i get error from that machine top - 14:10:15 up 23:12, 1 user, load average: 13.45, 12.91, 8.31 Tasks: 187 total, 1 running, 186 sleeping, 0 stopped, 0 zombie Cpu(s): 0.7%us, 0.4%sy, 0.0%ni, 0.0%id, 98.9%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 8061608k total, 7927124k used, 134484k free,19316k buffers Swap: 2097144k total, 384k used, 2096760k free, 6694656k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1622 hdfs 20 0 1619m 157m 11m S 2.0 2.0 33:55.42 java 14712 mapred20 0 709m 119m 11m S 1.3 1.5 0:10.06 java 1706 mapred20 0 1588m 126m 11m S 1.0 1.6 24:51.69 java 14663 mapred20 0 708m 89m 11m S 1.0 1.1 0:11.23 java 14686 mapred20 0 714m 106m 11m S 0.7 1.4 0:11.53 java 14762 mapred20 0 710m 89m 11m S 0.7 1.1 0:10.05 java 14640 mapred20 0 704m 119m 11m S 0.3 1.5 0:11.36 java Error Message: 12/05/22 14:09:52 INFO mapred.JobClient: Task Id : attempt_201205211504_0009_m_02_0, Status : FAILED java.io.IOException: All datanodes 10.0.24.175:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3181) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2720) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2892) attempt_201205211504_0009_m_02_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). attempt_201205211504_0009_m_02_0: log4j:WARN Please initialize the log4j system properly. But other map tasks are running on the same datanode. Thanks, sandeep.