[jira] [Created] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

xiaoli (Jira) Thu, 29 Apr 2021 06:36:11 -0700

xiaoli created SPARK-35274:
------------------------------

             Summary: old hive table's all columns are read when column pruning 
applies in spark3.0
                 Key: SPARK-35274
                 URL: https://issues.apache.org/jira/browse/SPARK-35274
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 3.0.0
         Environment: spark3.0
            Reporter: xiaoli



I asked this question 
[before|https://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
not addressed question clearly, so I did not get answer. This time I will show 
an example to illustrate this question.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
    val metrics = taskEnd.taskMetrics
    inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
 

 

In this example, I create orc table orc_table_old_schema, which schema has no 
field name and is written before HIVE-4243, to trigger this issue. You can see 
that input bytes for table orc_table_old_schema is 14 bytes more than table 
orc_table_new_schema. The reason is that spark3.0 read all columns for old hive 
schema table and read only pruning columns for new hive schema table. (This 
behavior is under flags: set spark.sql.hive.convertMetastoreOrc=true; set 
spark.sql.orc.impl=native; both flags value are spark3.0's default value)

 

Then I dig into spark code and find this:  
[https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149
 
|https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149]

It looks like read all columns for old hive schema (which has no field names) 
is by design for spark3.0

My questions is:

#1 Do you have plan to support column pruning for old hive schema in reading 
orc table?

#2 If question #1's answer is No. Is there some potential issue if code is 
fixed to support column pruning?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

Reply via email to