Whats the block size?
also are you experiencing any slowness in network?

i am guessing you are using EC2

these issues normally come with network problems

On Mon, May 28, 2012 at 3:57 PM, akshaymb <akshaybhara...@gmail.com> wrote:

>
> Hi,
>
> We are frequently observing the exception
> java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could
> not complete file
>
> /output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002.
> Giving up.
> on our cluster.  The exception occurs during writing a file.  We are using
> Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every
> 3 days.
>
> Detailed stack trace :
> 12/05/27 23:26:54 INFO mapred.JobClient: Task Id :
> attempt_201205232329_28133_r_000002_0, Status : FAILED
> java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could
> not complete file
>
> /output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002.
> Giving up.
>        at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
>        at
>
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
>        at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
>        at
>
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
>        at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Our investigation:
> We have min replication factor set to 2.  As mentioned
> http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here  , “A call
> to complete() will not return true until all the file's blocks have been
> replicated the minimum number of times.  Thus, DataNode failures may cause
> a
> client to call complete() several times before succeeding”, we should retry
> complete() several times.
> The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls
> complete() function and retries it for 20 times.  But in spite of that file
> blocks are not replicated minimum number of times. The retry count is not
> configurable.  Changing min replication factor to 1 is also not a good idea
> since there are continuously jobs running on our cluster.
>
> Do we have any solution / workaround for this problem?
>
> What is min replication factor in general used in industry.
>
> Let me know if any further inputs required.
>
> Thanks,
> -Akshay
>
>
>
> --
> View this message in context:
> http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Nitin Pawar

Reply via email to