[spark] branch branch-3.2 updated: [SPARK-36269][SQL] Fix only set data columns to Hive column names config

wenchen Mon, 26 Jul 2021 03:49:40 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.2 by this push:
     new f42cc10  [SPARK-36269][SQL] Fix only set data columns to Hive column 
names config
f42cc10 is described below

commit f42cc105129456126a7148cecfa937eedc06d5d1
Author: Cheng Su <[email protected]>
AuthorDate: Mon Jul 26 18:48:06 2021 +0800

    [SPARK-36269][SQL] Fix only set data columns to Hive column names config
    
    ### What changes were proposed in this pull request?
    
    When reading Hive table, we set the Hive column id and column name configs 
(`hive.io.file.readcolumn.ids` and `hive.io.file.readcolumn.names`). We should 
set non-partition columns (data columns) for both configs, as Spark always 
[appends partition columns in its own Hive 
reader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L240).
 The column id config has only non-partition columns, but column name config 
has both parti [...]
    
    ### Why are the changes needed?
    
    Fix the code logic to be more consistent.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing Hive tests.
    
    Closes #33489 from c21/hive-col.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    (cherry picked from commit e5616e32eecb516a6b46ae9bc5c2c850c18210a2)
    Signed-off-by: Wenchen Fan <[email protected]>
---
 .../scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala  | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
index 6eeead5..936cca4 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
@@ -117,8 +117,9 @@ case class HiveTableScanExec(
     // Specifies needed column IDs for those non-partitioning columns.
     val columnOrdinals = AttributeMap(relation.dataCols.zipWithIndex)
     val neededColumnIDs = output.flatMap(columnOrdinals.get).map(o => o: 
Integer)
+    val neededColumnNames = output.filter(columnOrdinals.contains).map(_.name)
 
-    HiveShim.appendReadColumns(hiveConf, neededColumnIDs, output.map(_.name))
+    HiveShim.appendReadColumns(hiveConf, neededColumnIDs, neededColumnNames)
 
     val deserializer = 
tableDesc.getDeserializerClass.getConstructor().newInstance()
     deserializer.initialize(hiveConf, tableDesc.getProperties)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.2 updated: [SPARK-36269][SQL] Fix only set data columns to Hive column names config

Reply via email to