[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021119#comment-17021119 ]
Ahmed Hussein commented on TEZ-3391: ------------------------------------ I agree with [~rohini] that the implementation is not efficient. The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to the task initializer. This may imply significant code changes. The existing code also has significant space overhead. Because each task creates an array of meta split. This means the code is {{n^2}} space complexity. The patch will reduce the space complexity but it each task needs to go through the entire meta file. [~jeagles], Can you please take a look at the patch and merge it at your convenience? > MR split file validation should be done in the AM > ------------------------------------------------- > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Ahmed Hussein > Priority: Major > Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch > > > We had a case where Split metadata size exceeded 10000000. Instead of job > failing from validation during initialization in AM like mapreduce, each of > the tasks failed doing that validation during initialization. > -- This message was sent by Atlassian Jira (v8.3.4#803005)