Whats the block size? also are you experiencing any slowness in network? i am guessing you are using EC2
these issues normally come with network problems On Mon, May 28, 2012 at 3:57 PM, akshaymb <akshaybhara...@gmail.com> wrote: > > Hi, > > We are frequently observing the exception > java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could > not complete file > > /output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002. > Giving up. > on our cluster. The exception occurs during writing a file. We are using > Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every > 3 days. > > Detailed stack trace : > 12/05/27 23:26:54 INFO mapred.JobClient: Task Id : > attempt_201205232329_28133_r_000002_0, Status : FAILED > java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could > not complete file > > /output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002. > Giving up. > at > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240) > at > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) > at > > org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > Our investigation: > We have min replication factor set to 2. As mentioned > http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here , “A call > to complete() will not return true until all the file's blocks have been > replicated the minimum number of times. Thus, DataNode failures may cause > a > client to call complete() several times before succeeding”, we should retry > complete() several times. > The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls > complete() function and retries it for 20 times. But in spite of that file > blocks are not replicated minimum number of times. The retry count is not > configurable. Changing min replication factor to 1 is also not a good idea > since there are continuously jobs running on our cluster. > > Do we have any solution / workaround for this problem? > > What is min replication factor in general used in industry. > > Let me know if any further inputs required. > > Thanks, > -Akshay > > > > -- > View this message in context: > http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Nitin Pawar