[ https://issues.apache.org/jira/browse/TEZ-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ming Ma updated TEZ-3215: ------------------------- Attachment: TEZ-3215-6.patch Thanks [~sseth]. Regarding OutputCommitter, it should be ok as long as basePath specified in write method is a relative path given all the files being written will be under the same work path of OutputCommitter. OutputCommitter#commitTask will handle all files under its work path. However if the basePath is an absolute path, the temporary files will be outside OutputCommitter's work path. I have added the check for that. All other issues you raised have been fixed. > Support for MultipleOutputs > --------------------------- > > Key: TEZ-3215 > URL: https://issues.apache.org/jira/browse/TEZ-3215 > Project: Apache Tez > Issue Type: Improvement > Reporter: Ming Ma > Assignee: Ming Ma > Attachments: TEZ-3215-2.patch, TEZ-3215-3.patch, TEZ-3215-4.patch, > TEZ-3215-5.patch, TEZ-3215-6.patch, TEZ-3215.patch > > > Here is the use case. A reducer might write its output to more than one file. > The file name will be based on the mapper key. We don't know all possible > keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't > find anything readily available in Tez. > * Set up one DataSink per file ahead of time won't work as we don't know all > possible keys. > * Use MR MultipleOutputs directly from the Tez application processor. It > isn't clear how to pass TaskInputOutputContext to MultipleOutputs. > * Tez MROutput can create a DataSink based on the specified outputFormat. But > it can't take MR MultipleOutputs. > I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve > this. If this is a solved problem, can anyone explain how to do it? -- This message was sent by Atlassian JIRA (v6.3.4#6332)