Hi Chris,

You may be hitting HDFS-2379.

Can you grep your DN logs for the string "BlockReport" and see if you
see any taking more than 30000ms or so?

-Todd

On Fri, Oct 7, 2011 at 6:31 AM, Chris Curtin <curtin.ch...@gmail.com> wrote:
> Sorry to bring this back from the dead, but we're having the issues again.
>
> This is on a NEW cluster, using Cloudera 0.20.2-cdh3u0 (old was stock Apache
> 0.20.2). Nothing carried over from the old cluster except data in HDFS
> (copied from old cluster). Bigger/more machines, more RAM, faster disks etc.
> And it is back.
>
> Confirmed that all the disks setup for HDFS are 'deadline'.
>
> Runs fine for  few days then hangs again with the 'Could not complete' error
> in the JobTracker log until we kill the cluster.
>
> 2011-09-09 08:04:32,429 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
> /log/hadoop/tmp/flow_BYVMTA_family_BYVMTA_72751_8284775/_logs/history/10.120.55.2_1311201333949_job_201107201835_13900_deliv_flow_BYVMTA%2Bflow_BYVMTA*family_B%5B%284%2F5%29+...UNCED%27%2C+
> retrying...
>
> Found HDFS-148 (https://issues.apache.org/jira/browse/HDFS-148) which looks
> like what could be happening to us. Anyone found a good workaround?
> Any other ideas?
>
> Also, does the HDFS system try to do 'du' on disks not assigned to it? The
> HDFS disks are separate from the root and OS disks. Those disks are NOT
> setup to be 'deadline'. Should that matter?
>
> Thanks,
>
> Chris
>
> On Tue, Mar 29, 2011 at 7:53 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote:
>
>> Hi Chris,
>>
>> One thing we've found helping in ext3 is examining your I/O scheduler.
>>  Make sure it's set to "deadline", not "CFQ".  This will help prevent nodes
>> from being overloaded; when "du -sk" is performed and the node is already
>> overloaded, things quickly roll downhill.
>>
>> Brian
>>
>> On Mar 29, 2011, at 11:44 AM, Chris Curtin wrote:
>>
>> > We are narrowing this down. The last few times it hung we found a 'du
>> -sk'
>> > process for each our HDFS disks as the top users of CPU. They are also
>> > taking a really long time.
>> >
>> > Searching around I find one example of someone reporting a similar issue
>> > with du -sk, but they tied it to XFS. We are using Ext3.
>> >
>> > Anyone have any other ideas since it appears to be related to the 'du'
>> not
>> > coming back? Note that running the command directly finishes in a few
>> > seconds.
>> >
>> > Thanks,
>> >
>> > Chris
>> >
>> > On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin <curtin.ch...@gmail.com
>> >wrote:
>> >
>> >> Caught something today I missed before:
>> >>
>> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> >> java.io.IOException: Bad connect ack with firstBadLink
>> 10.120.41.105:50010
>> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
>> >> blk_-517003810449127046_10039793
>> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
>> >> 10.120.41.103:50010
>> >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> >> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
>> >> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected
>> >> local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
>> >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
>> >> blk_2153189599588075377_10039793
>> >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
>> >> 10.120.41.105:50010
>> >> 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
>> >> /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...
>> >>
>> >>
>> >>
>> >> On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin <curtin.ch...@gmail.com
>> >wrote:
>> >>
>> >>> Thanks. Spent a lot of time looking at logs and nothing on the reducers
>> >>> until they start complaining about 'could not complete'.
>> >>>
>> >>> Found this in the jobtracker log file:
>> >>>
>> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
>> >>> DFSOutputStream ResponseProcessor exception  for block
>> >>> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for
>> block
>> >>> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
>> >>>        at
>> >>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
>> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> >>> Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
>> >>> 10.120.41.103:50010
>> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> >>> Recovery for block blk_3829493505250917008_9959810 in pipeline
>> >>> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
>> >>> datanode 10.120.41.103:50010
>> >>> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could
>> not
>> >>> complete file
>> >>>
>> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
>> >>> retrying...
>> >>>
>> >>> Looking at the logs from the various times this happens, the 'from
>> >>> datanode' in the first message is any of the data nodes (roughly equal
>> in #
>> >>> of times it fails), so I don't think it is one specific node having
>> >>> problems.
>> >>> Any other ideas?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Chris
>> >>>  On Sun, Mar 13, 2011 at 3:45 AM, icebergs <hkm...@gmail.com> wrote:
>> >>>
>> >>>> You should check the bad reducers' logs carefully.There may be more
>> >>>> information about it.
>> >>>>
>> >>>> 2011/3/10 Chris Curtin <curtin.ch...@gmail.com>
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> The last couple of days we have been seeing 10's of thousands of
>> these
>> >>>>> errors in the logs:
>> >>>>>
>> >>>>> INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
>> >>>>>
>> >>>>>
>> >>>>
>> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
>> >>>>> retrying...
>> >>>>> When this is going on the reducer in question is always the last
>> >>>> reducer in
>> >>>>> a job.
>> >>>>>
>> >>>>> Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
>> >>>> runs
>> >>>>> another and it succeeds. Sometimes hadoop kills the reducer and the
>> new
>> >>>> one
>> >>>>> also fails, so it gets killed and the cluster goes into a loop of
>> >>>>> kill/launch/kill.
>> >>>>>
>> >>>>> At first we thought it was related to the size of the data being
>> >>>> evaluated
>> >>>>> (4+GB), but we've seen it several times today on < 100 MB
>> >>>>>
>> >>>>> Searching here or online doesn't show a lot about what this error
>> means
>> >>>> and
>> >>>>> how to fix it.
>> >>>>>
>> >>>>> We are running 0.20.2, r911707
>> >>>>>
>> >>>>> Any suggestions?
>> >>>>>
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Chris
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to