[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ahmed Hussein updated TEZ-3391: ------------------------------- Description: During initialization, each task creates an array of objects \{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead as each task needs only its corresponding split object. Beside the current implementation is \{{n^2}} space complexity, it leaks the inputstream. We need to optimize that implementation by returning only a single object instead of an entire array. [~rohini] suggested the following: {quote} In the vertex construct TaskSplitMetaInfo only for the split of that task instead of constructing for all splits. ie change public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, FileSystem fs) to public static TaskSplitMetaInfo getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading splits below the index. If there are 1000 splits, the first task will read 1 split, second task will read 2 splits and so on instead of each task reading all the 1000 splits as is happening now. {quote} was: We had a case where Split metadata size exceeded 10000000. Instead of job failing from validation during initialization in AM like mapreduce, each of the tasks failed doing that validation during initialization. Summary: Optimize single split MR split reader (was: MR split file validation should be done in the AM) > Optimize single split MR split reader > ------------------------------------- > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Ahmed Hussein > Priority: Major > Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch > > > During initialization, each task creates an array of objects > \{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead > as each task needs only its corresponding split object. Beside the current > implementation is \{{n^2}} space complexity, it leaks the inputstream. > We need to optimize that implementation by returning only a single object > instead of an entire array. > [~rohini] suggested the following: > {quote} > In the vertex construct TaskSplitMetaInfo only for the split of that task > instead of constructing for all splits. ie change > public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, > FileSystem fs) to public static TaskSplitMetaInfo > getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip > reading splits below the index. If there are 1000 splits, the first task will > read 1 split, second task will read 2 splits and so on instead of each task > reading all the 1000 splits as is happening now. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)