I even tried to reduce number of jobs but didn't help. This is what I see:

datanode logs:

Initializing secure datanode resources
Successfully obtained privileged resources (streaming port =
ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port =
sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075])
Starting regular datanode initialization
26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return value
of 143

userlogs:

2012-04-26 19:35:22,801 WARN
org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is
available
2012-04-26 19:35:22,801 INFO
org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library
loaded
2012-04-26 19:35:22,808 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /125.18.62.197:50010, add to deadNodes and continue
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:298)
        at
org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClient.java:1664)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.java:2383)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2056)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at
org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:97)
        at
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
        at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
        at java.io.InputStream.read(InputStream.java:85)
        at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
        at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /125.18.62.204:50010, add to deadNodes and continue
java.io.EOFException

namenode logs:

2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job
job_201204261140_0244 added successfully for user 'hadoop' to queue
'default'
2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker:
Initializing job_201204261140_0244
2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger:
USER=hadoop  IP=125.18.62.196        OPERATION=SUBMIT_JOB
TARGET=job_201204261140_0244    RESULT=SUCCESS
2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress:
Initializing job_201204261140_0244
2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad
connect ack with firstBadLink as 125.18.62.197:50010
2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_2499580289951080275_22499
2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
datanode 125.18.62.197:50010
2012-04-26 16:12:53,594 INFO org.apache.hadoop.mapred.JobInProgress:
jobToken generated and stored with users keys in
/data/hadoop/mapreduce/job_201204261140_0244/jobToken
2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Input
size for job job_201204261140_0244 = 73808305. Number of splits = 1
2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201204261140_0244_m_000000 has split on node:/default-rack/
dsdb4.corp.intuit.net
2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201204261140_0244_m_000000 has split on node:/default-rack/
dsdb5.corp.intuit.net
2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
job_201204261140_0244 LOCALITY_WAIT_FACTOR=0.4
2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_201204261140_0244 initialized successfully with 1 map tasks and 0
reduce tasks.

On Fri, Apr 27, 2012 at 7:50 AM, Mohit Anchlia <mohitanch...@gmail.com>wrote:

>
>
>  On Thu, Apr 26, 2012 at 10:24 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Is only the same IP printed in all such messages? Can you check the DN
>> log in that machine to see if it reports any form of issues?
>>
>> All IPs were logged with this message
>
>
>> Also, did your jobs fail or kept going despite these hiccups? I notice
>> you're threading your clients though (?), but I can't tell if that may
>> cause this without further information.
>>
>> It started with this error message and slowly all the jobs died with
> "shortRead" errors.
> I am not sure about threading. I am using pig script to read .gz file
>
>
>> On Fri, Apr 27, 2012 at 5:19 AM, Mohit Anchlia <mohitanch...@gmail.com>
>> wrote:
>> > I had 20 mappers in parallel reading 20 gz files and each file around
>> > 30-40MB data over 5 hadoop nodes and then writing to the analytics
>> > database. Almost midway it started to get this error:
>> >
>> >
>> > 2012-04-26 16:13:53,723 [Thread-8] INFO
>> org.apache.hadoop.hdfs.DFSClient -
>> > Exception in createBlockOutputStream
>> > 17.18.62.192:50010java.io.IOException: Bad connect ack with
>>  > firstBadLink as
>> > 17.18.62.191:50010
>> >
>> > I am trying to look at the logs but doesn't say much. What could be the
>> > reason? We are in pretty closed reliable network and all machines are
>> up.
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Reply via email to