[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan updated TEZ-3291: ---------------------------------- Attachment: TEZ-3291.4.patch Attaching the patch which explicitly checks whether all splits having "localhost" (for s3). Added additional test case. > Optimize splits grouping when locality information is not available > ------------------------------------------------------------------- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.4.patch, > TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > ============== > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > ================== > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)