Re: hanging context.write() with large arrays

Zuhair Khayyat Sat, 05 May 2012 10:55:37 -0700

Hi all,

I think I solved the problem already.


The default OutputFormat used by Hadoop calls the function "toString()" to
create an output. The operation of appending large data into a single
"String" is very expensive; which explains why Hadoop takes forever to
write the output of a large array of values. To implement a faster array
writer, modify the "writeObject" of class "TextOutputFormat" and take
advantage of "String[] toString()" method of the "ArrayWritable" class and
write directly to disk as following:

********* original **********
private void writeObject(Object o) throws IOException {
      if (o instanceof Text) {
        Text to = (Text) o;
        out.write(to.getBytes(), 0, to.getLength());
      }else {
        out.write(o.toString().getBytes(utf8));
      }
    }

********* Modified ********
private void writeObject(Object o) throws IOException {
      if (o instanceof Text) {
        Text to = (Text) o;
        out.write(to.getBytes(), 0, to.getLength());
      } else if (o instanceof VLongArrayWritable){
          String [] myOut = ((VLongArrayWritable)o).toStrings();
          for(int i=0;i<myOut.length;i++){
              out.write((myOut[i]+" ").getBytes(utf8));
          }
      } else if (o instanceof LongArrayWritable){
          String [] myOut = ((LongArrayWritable)o).toStrings();
          for(int i=0;i<myOut.length;i++){
              out.write((myOut[i]+" ").getBytes(utf8));
          }
      }else {
        out.write(o.toString().getBytes(utf8));
      }
    }



Regards,
Zuhair Khayyat

On Sat, May 5, 2012 at 5:55 PM, Zuhair Khayyat
<[email protected]>wrote:

> Thanks for the fast response.. I think it is a good idea, however the
> application becomes too slow with large output arrays. I would be more
> interested in a solution that helps speeding up the "context.write()" it
> self.
>
>
> On Sat, May 5, 2012 at 5:36 PM, Zizon Qiu <[email protected]> wrote:
>
>> for the timeout problem,you can use a background thread that invoke
>> context.progress() timely which do "keep-alive" for forked
>> Child(mapper/combiner/reducer)...
>> it is tricky but works.
>>
>>
>> On Sat, May 5, 2012 at 10:05 PM, Zuhair Khayyat <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am building a MapReduce application that constructs the adjacency list
>>> of a graph from an input edge list. I noticed that my Reduce phase always
>>> hangs (and timeout eventually) as it calls the function
>>> context.write(Key_x,Value_x) when the Value_x is a very large ArrayWritable
>>> (around 4M elements). I have increased both "mapred.task.timeout" and the
>>> Reducers memory but no luck; the reducer does not finish the job. Is there
>>> any other data format that supports large amount of data or should I use my
>>> own "OutputFormat" class to optimize writing the large amount of data?
>>>
>>>
>>> Thank you.
>>> Zuhair Khayyat
>>>
>>
>>
>

Re: hanging context.write() with large arrays

Reply via email to