Hi Chris, You may be hitting HDFS-2379.
Can you grep your DN logs for the string "BlockReport" and see if you see any taking more than 30000ms or so? -Todd On Fri, Oct 7, 2011 at 6:31 AM, Chris Curtin <curtin.ch...@gmail.com> wrote: > Sorry to bring this back from the dead, but we're having the issues again. > > This is on a NEW cluster, using Cloudera 0.20.2-cdh3u0 (old was stock Apache > 0.20.2). Nothing carried over from the old cluster except data in HDFS > (copied from old cluster). Bigger/more machines, more RAM, faster disks etc. > And it is back. > > Confirmed that all the disks setup for HDFS are 'deadline'. > > Runs fine for few days then hangs again with the 'Could not complete' error > in the JobTracker log until we kill the cluster. > > 2011-09-09 08:04:32,429 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete file > /log/hadoop/tmp/flow_BYVMTA_family_BYVMTA_72751_8284775/_logs/history/10.120.55.2_1311201333949_job_201107201835_13900_deliv_flow_BYVMTA%2Bflow_BYVMTA*family_B%5B%284%2F5%29+...UNCED%27%2C+ > retrying... > > Found HDFS-148 (https://issues.apache.org/jira/browse/HDFS-148) which looks > like what could be happening to us. Anyone found a good workaround? > Any other ideas? > > Also, does the HDFS system try to do 'du' on disks not assigned to it? The > HDFS disks are separate from the root and OS disks. Those disks are NOT > setup to be 'deadline'. Should that matter? > > Thanks, > > Chris > > On Tue, Mar 29, 2011 at 7:53 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > >> Hi Chris, >> >> One thing we've found helping in ext3 is examining your I/O scheduler. >> Make sure it's set to "deadline", not "CFQ". This will help prevent nodes >> from being overloaded; when "du -sk" is performed and the node is already >> overloaded, things quickly roll downhill. >> >> Brian >> >> On Mar 29, 2011, at 11:44 AM, Chris Curtin wrote: >> >> > We are narrowing this down. The last few times it hung we found a 'du >> -sk' >> > process for each our HDFS disks as the top users of CPU. They are also >> > taking a really long time. >> > >> > Searching around I find one example of someone reporting a similar issue >> > with du -sk, but they tied it to XFS. We are using Ext3. >> > >> > Anyone have any other ideas since it appears to be related to the 'du' >> not >> > coming back? Note that running the command directly finishes in a few >> > seconds. >> > >> > Thanks, >> > >> > Chris >> > >> > On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin <curtin.ch...@gmail.com >> >wrote: >> > >> >> Caught something today I missed before: >> >> >> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in >> createBlockOutputStream >> >> java.io.IOException: Bad connect ack with firstBadLink >> 10.120.41.105:50010 >> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block >> >> blk_-517003810449127046_10039793 >> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node: >> >> 10.120.41.103:50010 >> >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in >> createBlockOutputStream >> >> java.net.SocketTimeoutException: 69000 millis timeout while waiting for >> >> channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected >> >> local=/10.120.41.85:34323 remote=/10.120.41.105:50010] >> >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block >> >> blk_2153189599588075377_10039793 >> >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node: >> >> 10.120.41.105:50010 >> >> 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file >> >> /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying... >> >> >> >> >> >> >> >> On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin <curtin.ch...@gmail.com >> >wrote: >> >> >> >>> Thanks. Spent a lot of time looking at logs and nothing on the reducers >> >>> until they start complaining about 'could not complete'. >> >>> >> >>> Found this in the jobtracker log file: >> >>> >> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: >> >>> DFSOutputStream ResponseProcessor exception for block >> >>> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for >> block >> >>> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010 >> >>> at >> >>> >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454) >> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error >> >>> Recovery for block blk_3829493505250917008_9959810 bad datanode[2] >> >>> 10.120.41.103:50010 >> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error >> >>> Recovery for block blk_3829493505250917008_9959810 in pipeline >> >>> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad >> >>> datanode 10.120.41.103:50010 >> >>> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could >> not >> >>> complete file >> >>> >> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T >> >>> retrying... >> >>> >> >>> Looking at the logs from the various times this happens, the 'from >> >>> datanode' in the first message is any of the data nodes (roughly equal >> in # >> >>> of times it fails), so I don't think it is one specific node having >> >>> problems. >> >>> Any other ideas? >> >>> >> >>> Thanks, >> >>> >> >>> Chris >> >>> On Sun, Mar 13, 2011 at 3:45 AM, icebergs <hkm...@gmail.com> wrote: >> >>> >> >>>> You should check the bad reducers' logs carefully.There may be more >> >>>> information about it. >> >>>> >> >>>> 2011/3/10 Chris Curtin <curtin.ch...@gmail.com> >> >>>> >> >>>>> Hi, >> >>>>> >> >>>>> The last couple of days we have been seeing 10's of thousands of >> these >> >>>>> errors in the logs: >> >>>>> >> >>>>> INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file >> >>>>> >> >>>>> >> >>>> >> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003 >> >>>>> retrying... >> >>>>> When this is going on the reducer in question is always the last >> >>>> reducer in >> >>>>> a job. >> >>>>> >> >>>>> Sometimes the reducer recovers. Sometimes hadoop kills that reducer, >> >>>> runs >> >>>>> another and it succeeds. Sometimes hadoop kills the reducer and the >> new >> >>>> one >> >>>>> also fails, so it gets killed and the cluster goes into a loop of >> >>>>> kill/launch/kill. >> >>>>> >> >>>>> At first we thought it was related to the size of the data being >> >>>> evaluated >> >>>>> (4+GB), but we've seen it several times today on < 100 MB >> >>>>> >> >>>>> Searching here or online doesn't show a lot about what this error >> means >> >>>> and >> >>>>> how to fix it. >> >>>>> >> >>>>> We are running 0.20.2, r911707 >> >>>>> >> >>>>> Any suggestions? >> >>>>> >> >>>>> >> >>>>> Thanks, >> >>>>> >> >>>>> Chris >> >>>>> >> >>>> >> >>> >> >>> >> >> >> >> > -- Todd Lipcon Software Engineer, Cloudera