[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-3291: Fix Version/s: 0.8.4 0.9.0 > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Minor > Fix For: 0.9.0, 0.8.4 > > Attachments: TEZ-3291.004.patch, TEZ-3291.2.patch, TEZ-3291.3.patch, > TEZ-3291.4.patch, TEZ-3291.5.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3291: -- Attachment: TEZ-3291.004.patch > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.004.patch, TEZ-3291.2.patch, TEZ-3291.3.patch, > TEZ-3291.4.patch, TEZ-3291.5.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3291: -- Attachment: TEZ-3291.5.patch Modified the patch to do the following. Tez does not need to handle special cases like localhost. Instead, if all splits do not have any location information and if the overall splits is less than the min-split size, tez can skip computing new desired split (i.e by saying Tez does not have enough information to compute new desired splits). Higher level apps can handle the localhost scenario and remove the locations if needed. For instance, Tez already exposes SplitLocationProvider which can be used by higher level apps and they can pass the appropriate locationprovider when computing getGroupedSplits. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.4.patch, > TEZ-3291.5.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3291: -- Attachment: TEZ-3291.4.patch Attaching the patch which explicitly checks whether all splits having "localhost" (for s3). Added additional test case. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.4.patch, > TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3291: -- Attachment: TEZ-3291.3.patch Attaching patch to address review comments. S3 urls can be explicitly checked in splits (by casting to fileSplit and checking getPath). But not sure, if we need to restrict this only for FileSplits in future. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3291: -- Attachment: TEZ-3291.2.patch Thanks for the review [~bikassaha]. Attaching the revised patch. It is hard to find out if it is coming from localhost or from S3. Number of nodes in the cluster could serve as a hint (to avoid single node cluster), but that info would not be available in split grouper. When {{lengthPerGroup > maxLengthPerGroup}}, it goes via the normal code path and gets more splits. It is also possible that couple of groups can have more number of splits than others. But having a max number of splits that can be accommodated in a group when "tez.grouping.by-length" is turned on would be tricky. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3291: -- Attachment: TEZ-3291.WIP.patch E.g TPC-DS query_22 @ 200GB scale: *Without Patch: 291.59 seconds* {noformat} Status: Running (Executing on YARN cluster with App id application_1464530613845_5129) -- VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 .. container SUCCEEDED 1 100 0 0 Map 4 .. container SUCCEEDED 1 100 0 0 Map 5 .. container SUCCEEDED 1 100 0 0 Map 6 .. container SUCCEEDED 1 100 0 0 Reducer 2 .. container SUCCEEDED 4 400 0 0 Reducer 3 .. container SUCCEEDED 1 100 0 0 -- VERTICES: 06/06 [==>>] 100% ELAPSED TIME: 291.59 s -- Status: DAG finished successfully in 291.59 seconds {noformat} *With Patch: 38.65 seconds* {noformat} -- VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 .. container SUCCEEDED 27 2700 0 0 Map 4 .. container SUCCEEDED 1 100 0 0 Map 5 .. container SUCCEEDED 1 100 0 0 Map 6 .. container SUCCEEDED 1 100 0 0 Reducer 2 .. container SUCCEEDED 4 400 0 0 Reducer 3 .. container SUCCEEDED 1 100 0 0 -- VERTICES: 06/06 [==>>] 100% ELAPSED TIME: 38.65 s -- Status: DAG finished successfully in 38.65 seconds {noformat} Map 1 is the vertex which scans inventory table. With the patch, it ends up creating 27 tasks which helps in improving overall response time. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer