Hi Harsh, Thanks for the input.
Yeah when I went through Hadoop-0.20.1 and Hadoop-0.21.0 code , got the same impression . But Since there are lots of changes in 0.21 and hence thought to still use 0.20.1. But to use Split-able feature of bzip2 tried changing FileInPutFormat by extending that but appears it was working fine for 500MB size of Bzip2 files but not for ~2GB size of bzip2 files where block size is 64MB. I think there are few more dependencies which I have not modified. When It was failing - actually it doesn't say task is failed , instead reducer kept trying running again and again. And during retry it fails actually in suffle phase, msg pasted below: - INFO org.apache.hadoop.mapred.ReduceTask: Failed to shuffle from > attempt_201104102321_0019_m_000022_0 > java.io.IOException: Premature EOF > at > sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:538) > at > sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:582) > at > sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:669) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2446) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleToDisk(ReduceTask.java:1624) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1416) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) > > *Do any body have some patch to append in Hadoop-0.20.1 to support bzip2 splitable? Would be really helpful. * Thanks & regards, - Deepak Diwakar, On 20 April 2011 00:37, Harsh J <ha...@cloudera.com> wrote: > Hello Deepak, > > On Tue, Apr 19, 2011 at 9:33 PM, Deepak Diwakar <ddeepa...@gmail.com> > wrote: > > Hi, > > > > I am using hadoop-0.20.1 > > But when I use my own InputFormat say SafeInputFormat( extends > > FileInputFormat ) and allow isSplitable true. It executes multiple > mappers, > > but fails when reducers reaches 33% for the large size(of order of 2 GB) > of > > bzip2 files. > > BZip2 splitting support was added to Apache Hadoop 0.21.0 release, and > isn't available in the Apache Hadoop 0.20.x. Was the 0.20.1 version a > typo? > Also, what reason/trace does the reducer throw up when it fails? > > -- > Harsh J >