[ 
https://issues.apache.org/jira/browse/HIVE-18572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HIVE-18572:
---------------------------------
    Description: 
The record reader needs to be fixed for tez, as it generates only 1 split due 
to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}.

This has been fixed in MR but not Tez.

====== Issue ====

I am seeing a strange behaviour in tez; it is seeing all data as a single split 
under hive, where as MR see all 79 files. This is causing all the data to go to 
a single map

TEZ Processing
 INFO  : Partition trusted.usage\{ds=20180126, periode=1200} stats: 
[numFiles=1, numRows=79575067, totalSize=3.164.605.993, 
rawDataSize=112439569671]
 ELAPSED TIME: 1958.99 s

MR Processing
 Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, 
numRows=79575067, totalSize=3172280778, rawDataSize=112418416260]
 ELAPSED TIME: 65 s

Log Tez
 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] 
|split.TezMapredSplitsGrouper|: Desired splits: 381 too large.  Desired 
splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Final 
desired splits: 381 All splits have localhost: false Total length: 19166265870 
Original splits: 1
 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] 
|split.TezMapredSplitsGrouper|: Using original number of splits: 1 desired 
splits: 381
 2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0|#0] 
|tez.SplitGrouper|: Original split size is 1 grouped split size is 1, for 
bucket: 1
 2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0|#0] 
|tez.HiveSplitGenerator|: Number of grouped splits: 1
 2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0|#0] 
|dag.RootInputInitializerManager|: Succeeded InputInitializer for Input: usage 
on vertex vertex_1517207496169_0085_1_00 [Map 1]
 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: 
Cannot init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 
numUnitializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: 
true
 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: 
Got updated RootInputsSpecs: {usage=forAllWorkUnits=true, update=[1]}
 2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: 
Vertex vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1

As per discussion with Gopal Vijayaraghavan:

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494]
  that line, right there MRv2 CombineInputFormat broke that rule, so the record 
readers had to be fixed to handle it 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L312]

  was:
The record reader needs to be fixed for tez, as it generates only 1 split due 
to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}.

This has been fixed in MR but not Tez.

I am seeing a strange behaviour in tez; it is seeing all data as a single split 
under hive, where as MR see all 79 files. This is causing all the data to go to 
a single map

TEZ Processing
INFO  : Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=1, 
numRows=79575067, totalSize=3.164.605.993, rawDataSize=112439569671]
ELAPSED TIME: 1958.99 s

MR Processing
Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, 
numRows=79575067, totalSize=3172280778, rawDataSize=112418416260]
ELAPSED TIME: 65 s

Log Tez
2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] 
|split.TezMapredSplitsGrouper|: Desired splits: 381 too large.  Desired 
splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Final 
desired splits: 381 All splits have localhost: false Total length: 19166265870 
Original splits: 1
2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] 
|split.TezMapredSplitsGrouper|: Using original number of splits: 1 desired 
splits: 381
2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0] 
|tez.SplitGrouper|: Original split size is 1 grouped split size is 1, for 
bucket: 1
2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0] 
|tez.HiveSplitGenerator|: Number of grouped splits: 1
2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0] 
|dag.RootInputInitializerManager|: Succeeded InputInitializer for Input: usage 
on vertex vertex_1517207496169_0085_1_00 [Map 1]
2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Cannot 
init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 
numUnitializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: 
true
2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Got 
updated RootInputsSpecs: \{usage=forAllWorkUnits=true, update=[1]}
2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Vertex 
vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1

As per discussion with Gopal Vijayaraghavan:

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494
 that line, right there MRv2 CombineInputFormat broke that rule, so the record 
readers had to be fixed to handle it 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L312


> The record readers; InputFormat needs to be fixed for Tez as it generates 1 
> split
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-18572
>                 URL: https://issues.apache.org/jira/browse/HIVE-18572
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 2.1.0
>            Reporter: gurmukh singh
>            Priority: Major
>
> The record reader needs to be fixed for tez, as it generates only 1 split due 
> to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}.
> This has been fixed in MR but not Tez.
> ====== Issue ====
> I am seeing a strange behaviour in tez; it is seeing all data as a single 
> split under hive, where as MR see all 79 files. This is causing all the data 
> to go to a single map
> TEZ Processing
>  INFO  : Partition trusted.usage\{ds=20180126, periode=1200} stats: 
> [numFiles=1, numRows=79575067, totalSize=3.164.605.993, 
> rawDataSize=112439569671]
>  ELAPSED TIME: 1958.99 s
> MR Processing
>  Partition trusted.usage\{ds=20180126, periode=1200} stats: [numFiles=79, 
> numRows=79575067, totalSize=3172280778, rawDataSize=112418416260]
>  ELAPSED TIME: 65 s
> Log Tez
>  2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 381 too large.  Desired 
> splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Final 
> desired splits: 381 All splits have localhost: false Total length: 
> 19166265870 Original splits: 1
>  2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0|#0] 
> |split.TezMapredSplitsGrouper|: Using original number of splits: 1 desired 
> splits: 381
>  2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0|#0] 
> |tez.SplitGrouper|: Original split size is 1 grouped split size is 1, for 
> bucket: 1
>  2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0|#0] 
> |tez.HiveSplitGenerator|: Number of grouped splits: 1
>  2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0|#0] 
> |dag.RootInputInitializerManager|: Succeeded InputInitializer for Input: 
> usage on vertex vertex_1517207496169_0085_1_00 [Map 1]
>  2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: 
> Cannot init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 
> numUnitializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: 
> true
>  2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: 
> Got updated RootInputsSpecs: {usage=forAllWorkUnits=true, update=[1]}
>  2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0|#0] |impl.VertexImpl|: 
> Vertex vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1
> As per discussion with Gopal Vijayaraghavan:
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494]
>   that line, right there MRv2 CombineInputFormat broke that rule, so the 
> record readers had to be fixed to handle it 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L312]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to