[ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-3391:
-------------------------------
    Description: 
During initialization, each task creates an array of objects 
\{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead 
as each task needs only its corresponding split object. Beside the current 
implementation is \{{n^2}} space complexity, it leaks the inputstream.

We need to optimize that implementation by returning only a single object 
instead of an entire array. 

[~rohini] suggested the following:
{quote}
In the vertex construct TaskSplitMetaInfo only for the split of that task 
instead of constructing for all splits. ie change
public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, 
FileSystem fs) to public static TaskSplitMetaInfo 
getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading 
splits below the index. If there are 1000 splits, the first task will read 1 
split, second task will read 2 splits and so on instead of each task reading 
all the 1000 splits as is happening now. 
{quote}

  was:
  We had a case  where Split metadata size exceeded 10000000. Instead of job 
failing from validation during initialization in AM like mapreduce, each of the 
tasks failed doing that validation during initialization.

  

        Summary: Optimize single split MR split reader  (was: MR split file 
validation should be done in the AM)

> Optimize single split MR split reader
> -------------------------------------
>
>                 Key: TEZ-3391
>                 URL: https://issues.apache.org/jira/browse/TEZ-3391
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Ahmed Hussein
>            Priority: Major
>         Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
> During initialization, each task creates an array of objects 
> \{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead 
> as each task needs only its corresponding split object. Beside the current 
> implementation is \{{n^2}} space complexity, it leaks the inputstream.
> We need to optimize that implementation by returning only a single object 
> instead of an entire array. 
> [~rohini] suggested the following:
> {quote}
> In the vertex construct TaskSplitMetaInfo only for the split of that task 
> instead of constructing for all splits. ie change
> public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, 
> FileSystem fs) to public static TaskSplitMetaInfo 
> getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip 
> reading splits below the index. If there are 1000 splits, the first task will 
> read 1 split, second task will read 2 splits and so on instead of each task 
> reading all the 1000 splits as is happening now. 
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to