Re: Map/Reduce Tasks Fails

2012-05-22 Thread Harsh J
Sandeep,

Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?

On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.

 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.



 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.

 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.



-- 
Harsh J


Re: Map/Reduce Tasks Fails

2012-05-22 Thread Raj Vishwanathan






 From: Harsh J ha...@cloudera.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, May 22, 2012 7:13 AM
Subject: Re: Map/Reduce Tasks Fails
 
Sandeep,

Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?

On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.

 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.



 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.

 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.



-- 
Harsh J




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Raj Vishwanathan
What kind of storage is attached to the data nodes ? This kind of error can 
happen when the CPU is really busy with I/O or interrupts.

Can you run top or dstat on some of the data nodes to see how the system is 
performing?

Raj




 From: Sandeep Reddy P sandeepreddy.3...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, May 22, 2012 7:23 AM
Subject: Re: Map/Reduce Tasks Fails
 
*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
Start* *Total
Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
Tasks Last Hour* *Seconds since heartbeat*
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225http://hadoop2.liaisondevqa.local:50060/
hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363http://hadoop4.liaisondevqa.local:50060/
hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605http://hadoop5.liaisondevqa.local:50060/
hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305http://hadoop3.liaisondevqa.local:50060/
hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0  Highest Failures:
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
failures




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Arun C Murthy
Seems like a question better suited for Cloudera lists...

On May 22, 2012, at 7:02 AM, Sandeep Reddy P wrote:

 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.
 
 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.
 
 
 
 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.
 
 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Sandeep Reddy P
I got samilar errors for Apache Hadoop 1.0.0
Thanks,
Sandeep.


Re: Map/Reduce Tasks Fails

2012-05-22 Thread Sandeep Reddy P
Raj,
Top from one datanode when i get error from that machine

top - 14:10:15 up 23:12,  1 user,  load average: 13.45, 12.91, 8.31
Tasks: 187 total,   1 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.7%us,  0.4%sy,  0.0%ni,  0.0%id, 98.9%wa,  0.0%hi,  0.1%si,
0.0%st
Mem:   8061608k total,  7927124k used,   134484k free,19316k buffers
Swap:  2097144k total,  384k used,  2096760k free,  6694656k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 1622 hdfs  20   0 1619m 157m  11m S  2.0  2.0  33:55.42 java
14712 mapred20   0  709m 119m  11m S  1.3  1.5   0:10.06 java
 1706 mapred20   0 1588m 126m  11m S  1.0  1.6  24:51.69 java
14663 mapred20   0  708m  89m  11m S  1.0  1.1   0:11.23 java
14686 mapred20   0  714m 106m  11m S  0.7  1.4   0:11.53 java
14762 mapred20   0  710m  89m  11m S  0.7  1.1   0:10.05 java
14640 mapred20   0  704m 119m  11m S  0.3  1.5   0:11.36 java

Error Message:
12/05/22 14:09:52 INFO mapred.JobClient: Task Id :
attempt_201205211504_0009_m_02_0, Status : FAILED
java.io.IOException: All datanodes 10.0.24.175:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3181)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2720)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2892)

attempt_201205211504_0009_m_02_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201205211504_0009_m_02_0: log4j:WARN Please initialize the
log4j system properly.

But other map tasks are running on the same datanode.

Thanks,
sandeep.