[ https://issues.apache.org/jira/browse/HIVE-18572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gurmukh singh updated HIVE-18572: --------------------------------- Description: The record reader needs to be fixed for tez, as it generates only 1 split due to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}. This has been fixed in MR but not Tez. ====== Issue ==== I am seeing a strange behaviour in tez; it is seeing all data as a single split under hive, where as MR see all 79 files. This is causing all the data to go to a single map TEZ Processing INFO : Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=1, numRows=79575067, totalSize=3.164.605.993, rawDataSize=112439569671] ELAPSED TIME: 1958.99 s MR Processing Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, numRows=79575067, totalSize=3172280778, rawDataSize=112418416260] ELAPSED TIME: 65 s Log Tez 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] |split.TezMapredSplitsGrouper|: Desired splits: 381 too large. Desired splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Final desired splits: 381 All splits have localhost: false Total length: 19166265870 Original splits: 1 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] |split.TezMapredSplitsGrouper|: Using original number of splits: 1 desired splits: 381 2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0|#0] |tez.SplitGrouper|: Original split size is 1 grouped split size is 1, for bucket: 1 2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0|#0] |tez.HiveSplitGenerator|: Number of grouped splits: 1 2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0|#0] |dag.RootInputInitializerManager|: Succeeded InputInitializer for Input: usage on vertex vertex_1517207496169_0085_1_00 [Map 1] 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: Cannot init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 numUnitializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: true 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: Got updated RootInputsSpecs: {usage=forAllWorkUnits=true, update=[1]} 2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: Vertex vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1 As per discussion with Gopal Vijayaraghavan: [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494] that line, right there MRv2 CombineInputFormat broke that rule, so the record readers had to be fixed to handle it [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L312] was: The record reader needs to be fixed for tez, as it generates only 1 split due to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}. This has been fixed in MR but not Tez. I am seeing a strange behaviour in tez; it is seeing all data as a single split under hive, where as MR see all 79 files. This is causing all the data to go to a single map TEZ Processing INFO : Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=1, numRows=79575067, totalSize=3.164.605.993, rawDataSize=112439569671] ELAPSED TIME: 1958.99 s MR Processing Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, numRows=79575067, totalSize=3172280778, rawDataSize=112418416260] ELAPSED TIME: 65 s Log Tez 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] |split.TezMapredSplitsGrouper|: Desired splits: 381 too large. Desired splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Final desired splits: 381 All splits have localhost: false Total length: 19166265870 Original splits: 1 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] |split.TezMapredSplitsGrouper|: Using original number of splits: 1 desired splits: 381 2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0] |tez.SplitGrouper|: Original split size is 1 grouped split size is 1, for bucket: 1 2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0] |tez.HiveSplitGenerator|: Number of grouped splits: 1 2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0] |dag.RootInputInitializerManager|: Succeeded InputInitializer for Input: usage on vertex vertex_1517207496169_0085_1_00 [Map 1] 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Cannot init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 numUnitializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: true 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Got updated RootInputsSpecs: \{usage=forAllWorkUnits=true, update=[1]} 2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Vertex vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1 As per discussion with Gopal Vijayaraghavan: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494 that line, right there MRv2 CombineInputFormat broke that rule, so the record readers had to be fixed to handle it https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L312 > The record readers; InputFormat needs to be fixed for Tez as it generates 1 > split > --------------------------------------------------------------------------------- > > Key: HIVE-18572 > URL: https://issues.apache.org/jira/browse/HIVE-18572 > Project: Hive > Issue Type: Bug > Components: Tez > Affects Versions: 2.1.0 > Reporter: gurmukh singh > Priority: Major > > The record reader needs to be fixed for tez, as it generates only 1 split due > to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}. > This has been fixed in MR but not Tez. > ====== Issue ==== > I am seeing a strange behaviour in tez; it is seeing all data as a single > split under hive, where as MR see all 79 files. This is causing all the data > to go to a single map > TEZ Processing > INFO : Partition trusted.usage\{ds=20180126, periode=1200} stats: > [numFiles=1, numRows=79575067, totalSize=3.164.605.993, > rawDataSize=112439569671] > ELAPSED TIME: 1958.99 s > MR Processing > Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, > numRows=79575067, totalSize=3172280778, rawDataSize=112418416260] > ELAPSED TIME: 65 s > Log Tez > 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] > |split.TezMapredSplitsGrouper|: Desired splits: 381 too large. Desired > splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Final > desired splits: 381 All splits have localhost: false Total length: > 19166265870 Original splits: 1 > 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] > |split.TezMapredSplitsGrouper|: Using original number of splits: 1 desired > splits: 381 > 2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0|#0] > |tez.SplitGrouper|: Original split size is 1 grouped split size is 1, for > bucket: 1 > 2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0|#0] > |tez.HiveSplitGenerator|: Number of grouped splits: 1 > 2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0|#0] > |dag.RootInputInitializerManager|: Succeeded InputInitializer for Input: > usage on vertex vertex_1517207496169_0085_1_00 [Map 1] > 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: > Cannot init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 > numUnitializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: > true > 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: > Got updated RootInputsSpecs: {usage=forAllWorkUnits=true, update=[1]} > 2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: > Vertex vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1 > As per discussion with Gopal Vijayaraghavan: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494] > that line, right there MRv2 CombineInputFormat broke that rule, so the > record readers had to be fixed to handle it > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L312] -- This message was sent by Atlassian JIRA (v7.6.3#76005)