[
https://issues.apache.org/jira/browse/TEZ-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908905#comment-13908905
]
Bikas Saha commented on TEZ-873:
--------------------------------
This jira as it stands is invalid. InputSplit is an MR concept. So in general
it should not move into Tez API's.
We have a design choice in Tez to make Inputs and Outputs independents of
processors as far as the framework is concerned. Thats prevents any framework
induced binding on the input/output/processor code. Compatibility of
inputs/outputs/processors is currently left to the user (eg. do all of them use
KeyValues) but may later be statically checked at compile by the framework via
annotations.
This usage of getting file name via splits is probably a hack which we did not
want to continue to support in MRInput. So we created MRInputLegacy to support
this. If you feel that additions are needed to make this work with
TezGroupedSplits then please go ahead and make the improvement to MRInputLegacy
and change the jira title to reflect this.
> Allow Tez processor access to meta-data through input split
> -----------------------------------------------------------
>
> Key: TEZ-873
> URL: https://issues.apache.org/jira/browse/TEZ-873
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Mohammad Kamrul Islam
> Assignee: Mohammad Kamrul Islam
>
> Currently there is no way of getting InputSplit from TezProcessor. In current
> MR framework, there is a way to find out the filename through FileSplit.
> For example, one common uses is to get the filename in map
> String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
> There are other meta-data in Inputsplit that could be used by existing MR
> user.
> This JIRA is to add APIs to expose the InputSplit by adding these
> TezGroupedSplit.getWrapperSplit() and MRInput.getInputSplit().
> Although MRInputLegacy provide an API to get the InputSplit, it has few
> issues:
> * Without TezGroupedSplit.getWrapperSplit() it is unusable.
> * Since it is used in various use cases, I propose to move it from
> MRInputLegacy to MRInput.
> * Currently the APIs are named as getNewInputSplit() and getOldInputSplit().
> These should be merged into one : getInputSplit(). The new/old API should be
> handled internally.
> Please give your feedback.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)