[jira] [Updated] (TEZ-3215) Support for MultipleOutputs

Ming Ma (JIRA) Thu, 15 Sep 2016 15:27:47 -0700

     [ 
https://issues.apache.org/jira/browse/TEZ-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ming Ma updated TEZ-3215:
-------------------------
    Attachment: TEZ-3215-4.patch

Thanks [~aplusplus]!

bq. LAZY_OUTPUTFORMAT_OUTPUTFORMAT looks like a config for mapreduce
In the MR implementation, LAZY_OUTPUTFORMAT_OUTPUTFORMAT is used by both 
mapreduce and mapred's LazyOutputFormat. Not sure exactly why.

bq. Using this method requires user to know what's the current value of 
'mapreduce.output.basename'
This configuration comes from 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.BASE_OUTPUT_NAME. MR 
applications can set this value to override the default prefix. Tez's MROutput 
also supports this.

{noformat}
    if (useNewApi) {
      // set the output part name to have a unique prefix
      if (jobConf.get("mapreduce.output.basename") == null) {
        jobConf.set("mapreduce.output.basename", getOutputFileNamePrefix());
      }
    }
{noformat}

Your suggestion to have MROutputs not to support write(key, value) makes sense. 
Here is the updated patch. It also defines mapreduce.output.basename in 
MRJobConfig.

> Support for MultipleOutputs
> ---------------------------
>
>                 Key: TEZ-3215
>                 URL: https://issues.apache.org/jira/browse/TEZ-3215
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: TEZ-3215-2.patch, TEZ-3215-3.patch, TEZ-3215-4.patch, 
> TEZ-3215.patch
>
>
> Here is the use case. A reducer might write its output to more than one file. 
> The file name will be based on the mapper key. We don't know all possible 
> keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't 
> find anything readily available in Tez.
> * Set up one DataSink per file ahead of time won't work as we don't know all 
> possible keys.
> * Use MR MultipleOutputs directly from the Tez application processor. It 
> isn't clear how to pass TaskInputOutputContext to MultipleOutputs.
> * Tez MROutput can create a DataSink based on the specified outputFormat. But 
> it can't take MR MultipleOutputs.
> I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve 
> this. If this is a solved problem, can anyone explain how to do it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-3215) Support for MultipleOutputs

Reply via email to