We are narrowing this down. The last few times it hung we found a 'du -sk'
process for each our HDFS disks as the top users of CPU. They are also
taking a really long time.

Searching around I find one example of someone reporting a similar issue
with du -sk, but they tied it to XFS. We are using Ext3.

Anyone have any other ideas since it appears to be related to the 'du' not
coming back? Note that running the command directly finishes in a few
seconds.

Thanks,

Chris

On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin <curtin.ch...@gmail.com>wrote:

> Caught something today I missed before:
>
> 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010
> 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
> blk_-517003810449127046_10039793
> 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
> 10.120.41.103:50010
> 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
> 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
> blk_2153189599588075377_10039793
> 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
> 10.120.41.105:50010
> 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
> /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...
>
>
>
> On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin <curtin.ch...@gmail.com>wrote:
>
>> Thanks. Spent a lot of time looking at logs and nothing on the reducers
>> until they start complaining about 'could not complete'.
>>
>> Found this in the jobtracker log file:
>>
>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
>> DFSOutputStream ResponseProcessor exception  for block
>> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
>> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
>>         at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
>> 10.120.41.103:50010
>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_3829493505250917008_9959810 in pipeline
>> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
>> datanode 10.120.41.103:50010
>> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> complete file
>> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
>> retrying...
>>
>> Looking at the logs from the various times this happens, the 'from
>> datanode' in the first message is any of the data nodes (roughly equal in #
>> of times it fails), so I don't think it is one specific node having
>> problems.
>> Any other ideas?
>>
>> Thanks,
>>
>> Chris
>>   On Sun, Mar 13, 2011 at 3:45 AM, icebergs <hkm...@gmail.com> wrote:
>>
>>> You should check the bad reducers' logs carefully.There may be more
>>> information about it.
>>>
>>> 2011/3/10 Chris Curtin <curtin.ch...@gmail.com>
>>>
>>> > Hi,
>>> >
>>> > The last couple of days we have been seeing 10's of thousands of these
>>> > errors in the logs:
>>> >
>>> >  INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
>>> >
>>> >
>>> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
>>> > retrying...
>>> > When this is going on the reducer in question is always the last
>>> reducer in
>>> > a job.
>>> >
>>> > Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
>>> runs
>>> > another and it succeeds. Sometimes hadoop kills the reducer and the new
>>> one
>>> > also fails, so it gets killed and the cluster goes into a loop of
>>> > kill/launch/kill.
>>> >
>>> > At first we thought it was related to the size of the data being
>>> evaluated
>>> > (4+GB), but we've seen it several times today on < 100 MB
>>> >
>>> > Searching here or online doesn't show a lot about what this error means
>>> and
>>> > how to fix it.
>>> >
>>> > We are running 0.20.2, r911707
>>> >
>>> > Any suggestions?
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Chris
>>> >
>>>
>>
>>
>

Reply via email to