Re: MultipleOutputs or MultipleTextOutputFormat?

Ankur Goel Thu, 28 May 2009 22:16:30 -0700

One way of doing what you need is to extend MultipleTextOutputFormat and 
override the following APIs


- generateFileNameForKeyValue()
- generateActualKey()
- generateActualValue()

You will need to prefix the directory and file-name of your choice to the 
key/value depending upon your needs. Assuming key and value types to be Text 
here is some sample code for reference

 public String generateFileNameForKeyValue(Text key, Text v, String name) {
    /*
     * split the default name (for e.x. part-00000 into ['part', '00000'] )
     */
    String[] nameparts = name.split("-");

    String keyStr = key.toString();
    /**
     * assuming desired filename is prefixed to the key and separated from the
     * actual key contents by '\t'
     */
    int idx = keyStr.indexOf("\t");

    /*
     * get the file name
     */
    name = keyStr.substring(0, idx);

    /**
     * return the path of the form 'fileName/fileName-0000'
     * This makes sure that fileName dir is created under job's output dir
     * and all the keys with that prefix go into reducer-specific files under
     * that dir.
     */
    return new Path(name, name + "-" + nameparts[1]).toString();
  }

  public Text generateActualKey(Text key, Text value) {
    String keyStr = key.toString();
    int idx = keyStr.indexOf("\t") + 1;
    return new Text(keyStr.substring(idx));
  }

 Hope that helps.

-Ankur

----- Original Message -----
From: "Kevin Peterson" <kpeter...@biz360.com>
To: core-user@hadoop.apache.org
Sent: Friday, May 29, 2009 4:55:22 AM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: MultipleOutputs or MultipleTextOutputFormat?

I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:

...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090501
/content/200904/batch_20090502
/content/200905/batch_20090430
/content/200905/batch_20090501
/content/200905/batch_20090502
...

I would then run my nightly jobs to build the index on /content/200904/* for
the April index and /content/200905/* for the May index.

I'm not sure whether I would be better off using MultipleOutputs or
MultipleTextOutputFormat. I'm having trouble understanding how I set the
output path for these two classes. It seems like MultipleTextOutputFormat is
about partitioning data to different files within the same directory on the
key, rather than into different directories as I need. Could I get the
behavior I want by specifying date/batch as my filename, set output path to
some temporary work directory, then move /work/* to /content?

MultipleOutputs seems to be more about outputting all the data in different
formats, but it's supposed to be simpler to use. Reading it, it seems to be
better documented and the API makes more sense (choosing the output
explicitly in the map or reduce, rather than hiding this decision in the
output format), but I don't see any way to set a file name. If am using
textoutputformat, I see no way to put these into different directories.

Re: MultipleOutputs or MultipleTextOutputFormat?

Reply via email to