Hi all, I think I solved the problem already.
The default OutputFormat used by Hadoop calls the function "toString()" to create an output. The operation of appending large data into a single "String" is very expensive; which explains why Hadoop takes forever to write the output of a large array of values. To implement a faster array writer, modify the "writeObject" of class "TextOutputFormat" and take advantage of "String[] toString()" method of the "ArrayWritable" class and write directly to disk as following: ********* original ********** private void writeObject(Object o) throws IOException { if (o instanceof Text) { Text to = (Text) o; out.write(to.getBytes(), 0, to.getLength()); }else { out.write(o.toString().getBytes(utf8)); } } ********* Modified ******** private void writeObject(Object o) throws IOException { if (o instanceof Text) { Text to = (Text) o; out.write(to.getBytes(), 0, to.getLength()); } else if (o instanceof VLongArrayWritable){ String [] myOut = ((VLongArrayWritable)o).toStrings(); for(int i=0;i<myOut.length;i++){ out.write((myOut[i]+" ").getBytes(utf8)); } } else if (o instanceof LongArrayWritable){ String [] myOut = ((LongArrayWritable)o).toStrings(); for(int i=0;i<myOut.length;i++){ out.write((myOut[i]+" ").getBytes(utf8)); } }else { out.write(o.toString().getBytes(utf8)); } } Regards, Zuhair Khayyat On Sat, May 5, 2012 at 5:55 PM, Zuhair Khayyat <zuhair.khay...@kaust.edu.sa>wrote: > Thanks for the fast response.. I think it is a good idea, however the > application becomes too slow with large output arrays. I would be more > interested in a solution that helps speeding up the "context.write()" it > self. > > > On Sat, May 5, 2012 at 5:36 PM, Zizon Qiu <zzd...@gmail.com> wrote: > >> for the timeout problem,you can use a background thread that invoke >> context.progress() timely which do "keep-alive" for forked >> Child(mapper/combiner/reducer)... >> it is tricky but works. >> >> >> On Sat, May 5, 2012 at 10:05 PM, Zuhair Khayyat < >> zuhair.khay...@kaust.edu.sa> wrote: >> >>> Hi, >>> >>> I am building a MapReduce application that constructs the adjacency list >>> of a graph from an input edge list. I noticed that my Reduce phase always >>> hangs (and timeout eventually) as it calls the function >>> context.write(Key_x,Value_x) when the Value_x is a very large ArrayWritable >>> (around 4M elements). I have increased both "mapred.task.timeout" and the >>> Reducers memory but no luck; the reducer does not finish the job. Is there >>> any other data format that supports large amount of data or should I use my >>> own "OutputFormat" class to optimize writing the large amount of data? >>> >>> >>> Thank you. >>> Zuhair Khayyat >>> >> >> >