[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-11-02 Thread xiaoli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437213#comment-17437213
 ] 

xiaoli commented on SPARK-37034:


[~cloud_fan] Thanks very much. The hint in your answer helps me find 
ApplyColumnarRulesAndInsertTransitions class in spark, I could do more 
investigation now.

> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-11-01 Thread xiaoli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436825#comment-17436825
 ] 

xiaoli commented on SPARK-37034:


 

[~dongjoon] [~yumwang] [~cloud_fan] Sorry to ping you, as there is no 
answer/comments of this question for a week. All of you have contribute to 
spark's vectorized read, so I guess you may know something about this question. 
Could you look at this question?

> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-10-17 Thread xiaoli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoli updated SPARK-37034:
---
Description: 
Spark has support vectorized read for ORC and parquet. What's the progress of 
other vectorized execution, e.g. vectorized write,  join, aggr, simple operator 
(string function, math function)? 

Hive support vectorized execution in early version 
(https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
As we know, Spark is replacement of Hive. I guess the reason why Spark does not 
support vectorized execution maybe the difficulty of design or implementation 
in Spark is larger. What's the main issue for Spark to support vectorized 
execution?

  was:
Spark has support vectorized read for ORC and parquet. What's the progress of 
other vectorized execution, e.g. vectorized write,  join, aggr, simple operator 
(string function, math function)? 

Hive support vectorized execution in [early 
version|[https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution]|https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution].]
 As we know, Spark is replacement of Hive. I guess the reason why Spark does 
not support vectorized execution maybe the difficulty of design or 
implementation in Spark is larger. What's the main issue for Spark to support 
vectorized execution?


> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-10-17 Thread xiaoli (Jira)
xiaoli created SPARK-37034:
--

 Summary: What's the progress of vectorized execution for spark?
 Key: SPARK-37034
 URL: https://issues.apache.org/jira/browse/SPARK-37034
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: xiaoli


Spark has support vectorized read for ORC and parquet. What's the progress of 
other vectorized execution, e.g. vectorized write,  join, aggr, simple operator 
(string function, math function)? 

Hive support vectorized execution in [early 
version|[https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution]|https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution].]
 As we know, Spark is replacement of Hive. I guess the reason why Spark does 
not support vectorized execution maybe the difficulty of design or 
implementation in Spark is larger. What's the main issue for Spark to support 
vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

2021-05-01 Thread xiaoli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoli updated SPARK-35274:
---
External issue ID: HIVE-4243
   Labels: hive orc  (was: )

> old hive table's all columns are read when column pruning applies in spark3.0
> -
>
> Key: SPARK-35274
> URL: https://issues.apache.org/jira/browse/SPARK-35274
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark3.0
>Reporter: xiaoli
>Priority: Major
>  Labels: hive, orc
>
> I asked this question 
> [before|https://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
> not addressed question clearly, so I did not get answer. This time I will 
> show an example to illustrate this question clearly.
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
> val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
> var inputBytes = 0L
> spark.sparkContext.addSparkListener(new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
> val metrics = taskEnd.taskMetrics
> inputBytes += metrics.inputMetrics.bytesRead 
>   } 
> }) 
> spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
> double) STORED AS ORC;")
> spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
> 1000.05")
> inputBytes = 0L
> spark.sql("select _col2 from orc_table_old_schema").show()
> print("input bytes for old schema table: " + inputBytes) // print 1655 
> spark.sql("create table orc_table_new_schema (id int, name string, value 
> double) STORED AS ORC;")
> spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
> 1000.05")
> inputBytes = 0L
> spark.sql("select value from orc_table_new_schema").show()
> print("input bytes for new schema table: " + inputBytes) // print 1641
> {code}
> This example is run on spark3.0 with default flags. In this example, I create 
> orc table orc_table_old_schema, which schema has no field name and is written 
> before HIVE-4243, to trigger this issue. You can see that input bytes for 
> table orc_table_old_schema is 14 bytes more than table orc_table_new_schema.
> The reason is that spark3.0 default use native reader rather than hive reader 
> to read orc table, and native reader read all columns for old hive schema 
> table and read only pruning columns (in this example, only column 'value' is 
> read) for new hive schema table.
> The following flags enable native reader: set 
> spark.sql.hive.convertMetastoreOrc=true; set spark.sql.orc.impl=native; both 
> flags value are spark3.0's default value
> Then I dig into spark code and find this:  
> [https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149
>  
> |https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149]
> It looks like read all columns for old hive schema (which has no field names) 
> is by design for spark3.0
> In my company data, some table schema is old hive, while some table schema is 
> new hive. The performance of query reading old hive table decreases a lot 
> when I enable native reader in spark3.0. This is main block for us to switch 
> hive reader to native reader in spark3.0. 
> My questions is:
> #1 Do you have plan to support column pruning for old hive schema in native 
> orc reader?
> #2 If question #1's answer is No. Is there some potential issue if code is 
> fixed to support column pruning?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

2021-04-29 Thread xiaoli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoli updated SPARK-35274:
---
Description: 
I asked this question 
[before|https://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
not addressed question clearly, so I did not get answer. This time I will show 
an example to illustrate this question clearly.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
This example is run on spark3.0 with default flags. In this example, I create 
orc table orc_table_old_schema, which schema has no field name and is written 
before HIVE-4243, to trigger this issue. You can see that input bytes for table 
orc_table_old_schema is 14 bytes more than table orc_table_new_schema.

The reason is that spark3.0 default use native reader rather than hive reader 
to read orc table, and native reader read all columns for old hive schema table 
and read only pruning columns (in this example, only column 'value' is read) 
for new hive schema table.

The following flags enable native reader: set 
spark.sql.hive.convertMetastoreOrc=true; set spark.sql.orc.impl=native; both 
flags value are spark3.0's default value

Then I dig into spark code and find this:  
[https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149
 
|https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149]

It looks like read all columns for old hive schema (which has no field names) 
is by design for spark3.0

In my company data, some table schema is old hive, while some table schema is 
new hive. The performance of query reading old hive table decreases a lot when 
I enable native reader in spark3.0. This is main block for us to switch hive 
reader to native reader in spark3.0. 

My questions is:

#1 Do you have plan to support column pruning for old hive schema in native orc 
reader?

#2 If question #1's answer is No. Is there some potential issue if code is 
fixed to support column pruning?

 

  was:
I asked this question 
[before|hhttps://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
not addressed question clearly, so I did not get answer. This time I will show 
an example to illustrate this question clearly.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
This example is run on spark3.0 with default flags. In this example, I create 
orc table orc_table_old_schema, which schema has no field name and is written 
before HIVE-4243, to trigger this issue. You can see that input bytes for table 
orc_table_old_schema is 14 bytes more than table orc_table_new_schema.

The reason is that spark3.0 default use native reader rather than hive reader 
to read orc table, and native reader read 

[jira] [Updated] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

2021-04-29 Thread xiaoli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoli updated SPARK-35274:
---
Description: 
I asked this question 
[before|hhttps://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
not addressed question clearly, so I did not get answer. This time I will show 
an example to illustrate this question clearly.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
This example is run on spark3.0 with default flags. In this example, I create 
orc table orc_table_old_schema, which schema has no field name and is written 
before HIVE-4243, to trigger this issue. You can see that input bytes for table 
orc_table_old_schema is 14 bytes more than table orc_table_new_schema.

The reason is that spark3.0 default use native reader rather than hive reader 
to read orc table, and native reader read all columns for old hive schema table 
and read only pruning columns (in this example, only column 'value' is read) 
for new hive schema table.

The following flags enable native reader: set 
spark.sql.hive.convertMetastoreOrc=true; set spark.sql.orc.impl=native; both 
flags value are spark3.0's default value

Then I dig into spark code and find this:  
[https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149
 
|https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149]

It looks like read all columns for old hive schema (which has no field names) 
is by design for spark3.0

In my company data, some table schema is old hive, while some table schema is 
new hive. The performance of query reading old hive table decreases a lot when 
I enable native reader in spark3.0. This is main block for us to switch hive 
reader to native reader in spark3.0. 

My questions is:

#1 Do you have plan to support column pruning for old hive schema in native orc 
reader?

#2 If question #1's answer is No. Is there some potential issue if code is 
fixed to support column pruning?

 

  was:
I asked this question before, but perhaps I did not addressed question clearly, 
so I did not get answer. This time I will show an example to illustrate this 
question clearly.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
In this example, I create orc table orc_table_old_schema, which schema has no 
field name and is written before HIVE-4243, to trigger this issue. You can see 
that input bytes for table orc_table_old_schema is 14 bytes more than table 
orc_table_new_schema.

The reason is that spark3.0 default use native reader rather than hive reader 
to read orc table, and native reader read all columns for old hive schema table 
and read only pruning columns (in this example, only column 'value' 

[jira] [Updated] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

2021-04-29 Thread xiaoli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoli updated SPARK-35274:
---
Description: 
I asked this question before, but perhaps I did not addressed question clearly, 
so I did not get answer. This time I will show an example to illustrate this 
question clearly.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
In this example, I create orc table orc_table_old_schema, which schema has no 
field name and is written before HIVE-4243, to trigger this issue. You can see 
that input bytes for table orc_table_old_schema is 14 bytes more than table 
orc_table_new_schema.

The reason is that spark3.0 default use native reader rather than hive reader 
to read orc table, and native reader read all columns for old hive schema table 
and read only pruning columns (in this example, only column 'value' is read) 
for new hive schema table.

The following flags enable native reader: set 
spark.sql.hive.convertMetastoreOrc=true; set spark.sql.orc.impl=native; both 
flags value are spark3.0's default value

Then I dig into spark code and find this:  
[https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149
 
|https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149]

It looks like read all columns for old hive schema (which has no field names) 
is by design for spark3.0

In my company data, some table schema is old hive, while some table schema is 
new hive. The performance of query reading old hive table decrease a lot when I 
enable native reader in spark3.0. This is main block for us to switch hive 
reader to native reader in spark3.0. 

My questions is:

#1 Do you have plan to support column pruning for old hive schema in native orc 
reader?

#2 If question #1's answer is No. Is there some potential issue if code is 
fixed to support column pruning?

 

  was:
I asked this question 
[before|https://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
not addressed question clearly, so I did not get answer. This time I will show 
an example to illustrate this question.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
 

 

In this example, I create orc table orc_table_old_schema, which schema has no 
field name and is written before HIVE-4243, to trigger this issue. You can see 
that input bytes for table orc_table_old_schema is 14 bytes more than table 
orc_table_new_schema. The reason is that spark3.0 read all columns for old hive 
schema table and read only pruning columns for new hive schema table. (This 
behavior is under flags: set spark.sql.hive.convertMetastoreOrc=true; set 
spark.sql.orc.impl=native; both flags value are spark3.0's default value)

[jira] [Created] (SPARK-35274) old hive table's all columns are read when column pruning applies in spark3.0

2021-04-29 Thread xiaoli (Jira)
xiaoli created SPARK-35274:
--

 Summary: old hive table's all columns are read when column pruning 
applies in spark3.0
 Key: SPARK-35274
 URL: https://issues.apache.org/jira/browse/SPARK-35274
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark3.0
Reporter: xiaoli


I asked this question 
[before|https://issues.apache.org/jira/browse/SPARK-35190], but perhaps I did 
not addressed question clearly, so I did not get answer. This time I will show 
an example to illustrate this question.
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} 
val spark = SparkSession.builder().appName("OrcTest").getOrCreate()
var inputBytes = 0L
spark.sparkContext.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { 
val metrics = taskEnd.taskMetrics
inputBytes += metrics.inputMetrics.bytesRead 
  } 
}) 
spark.sql("create table orc_table_old_schema (_col0 int, _col1 string, _col2 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_old_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select _col2 from orc_table_old_schema").show()
print("input bytes for old schema table: " + inputBytes) // print 1655 

spark.sql("create table orc_table_new_schema (id int, name string, value 
double) STORED AS ORC;")
spark.sql("insert overwrite table orc_table_new_schema select 1, 'name1', 
1000.05")
inputBytes = 0L
spark.sql("select value from orc_table_new_schema").show()
print("input bytes for new schema table: " + inputBytes) // print 1641
{code}
 

 

In this example, I create orc table orc_table_old_schema, which schema has no 
field name and is written before HIVE-4243, to trigger this issue. You can see 
that input bytes for table orc_table_old_schema is 14 bytes more than table 
orc_table_new_schema. The reason is that spark3.0 read all columns for old hive 
schema table and read only pruning columns for new hive schema table. (This 
behavior is under flags: set spark.sql.hive.convertMetastoreOrc=true; set 
spark.sql.orc.impl=native; both flags value are spark3.0's default value)

 

Then I dig into spark code and find this:  
[https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149
 
|https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L149]

It looks like read all columns for old hive schema (which has no field names) 
is by design for spark3.0

My questions is:

#1 Do you have plan to support column pruning for old hive schema in reading 
orc table?

#2 If question #1's answer is No. Is there some potential issue if code is 
fixed to support column pruning?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35190) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-24 Thread xiaoli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331418#comment-17331418
 ] 

xiaoli commented on SPARK-35190:


Hi [~hyukjin.kwon] Thanks for your checking my question.

I read [Spark-34897|https://issues.apache.org/jira/browse/SPARK-34897], and it 
is about reading struct type data issue. That could not solve my problem.

Besides, [SPARK-35010|https://issues.apache.org/jira/browse/SPARK-35010] 
mentions that "column pruning in hive generated orc files is not supported". Is 
it true? Do you have plan in near future to make spark native reader to support 
column pruning in hive generated orc files, which column schema is column index?

> all columns are read even if column pruning applies when spark3.0 read table 
> written by spark2.2
> 
>
> Key: SPARK-35190
> URL: https://issues.apache.org/jira/browse/SPARK-35190
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark3.0
> set spark.sql.hive.convertMetastoreOrc=true (default value in spark3.0)
> set spark.sql.orc.impl=native(default velue in spark3.0)
>Reporter: xiaoli
>Priority: Major
>
> Before I address this issue, let me talk about the issue background: The 
> current spark version we use is 2.2, and we plan to migrate to spark3.0 in 
> near future. Before migration, we test some query in both spark2.2 and 
> spark3.0 to check potential issue. The data source table of these query is 
> orc format written by spark2.2.
>  
> I find that even if column pruning is applied, spark3.0’s native reader will 
> read all columns.
>  
> Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it 
> will check whether field name is started with “_col”. In my case, field name 
> is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The 
> code is below:
>  
> if (orcFieldNames.forall(_.startsWith("_col"))) {
>   // This is a ORC file written by Hive, no field names in the physical 
> schema, assume the
>   // physical schema maps to the data scheme by index.
>   _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema 
> " +
>     s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
> physical schema, " +
>     "no idea which columns were dropped, fail to read.")
>   // for ORC file written by Hive, no field names
>   // in the physical schema, there is a need to send the
>   // entire dataSchema instead of required schema.
>   // So pruneCols is not done in this case
>   Some(requiredSchema.fieldNames.map { name =>
>     val index = dataSchema.fieldIndex(name)
>     if (index < orcFieldNames.length) {
>       index
>     } else {
>       -1
>     }
>   }, false)
>  
>  Although this code comment explains reason, I still do not understand. This 
> issue only happens in this case: spark3.0 uses native reader to read table 
> written by spark2.2. 
>  
> In other cases, there is no such issue. I do another 2 tests:
> Test1: use spark3.0’s hive reader (running with 
> spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
> the same table, it only reads pruned columns.
> Test2: use spark3.0 to write a table, then use spark3.0’s native reader to 
> read this new table, it only reads pruned columns.
>  
> This issue I mentioned is a block we use native reader in spark3.0. Can 
> anyone know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35191) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-22 Thread xiaoli (Jira)
xiaoli created SPARK-35191:
--

 Summary: all columns are read even if column pruning applies when 
spark3.0 read table written by spark2.2
 Key: SPARK-35191
 URL: https://issues.apache.org/jira/browse/SPARK-35191
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: spark3.0

spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0)

spark.sql.orc.impl=native(default value in spark3.0)
Reporter: xiaoli


Before I address this issue, let me talk about the issue background: The 
current spark version we use is 2.2, and we plan to migrate to spark3.0 in near 
future. Before migration, we test some query in both spark2.2 and spark3.0 to 
check potential issue. The data source table of these query is orc format 
written by spark2.2.

 

I find that even if column pruning is applied, spark3.0’s native reader will 
read all columns.

 

Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will 
check whether field name is started with “_col”. In my case, field name is 
started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The code 
is below:

 

if (orcFieldNames.forall(_.startsWith("_col"))) {

  // This is a ORC file written by Hive, no field names in the physical schema, 
assume the

  // physical schema maps to the data scheme by index.

  _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema " +

    s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
physical schema, " +

    "no idea which columns were dropped, fail to read.")

  // for ORC file written by Hive, no field names

  // in the physical schema, there is a need to send the

  // entire dataSchema instead of required schema.

  // So pruneCols is not done in this case

  Some(requiredSchema.fieldNames.map { name =>

    val index = dataSchema.fieldIndex(name)

    if (index < orcFieldNames.length) {

      index

    } else {

      -1

    }

  }, false)
 
 Although this code comment explains reason, I still do not understand. This 
issue only happens in this case: spark3.0 uses native reader to read table 
written by spark2.2. 

 

In other cases, there is no such issue. I do another 2 tests:

Test1: use spark3.0’s hive reader (running with 
spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
the same table, it only reads pruned columns.

Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read 
this new table, it only reads pruned columns.

 

This issue I mentioned is a block we use native reader in spark3.0. Can anyone 
know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35190) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-22 Thread xiaoli (Jira)
xiaoli created SPARK-35190:
--

 Summary: all columns are read even if column pruning applies when 
spark3.0 read table written by spark2.2
 Key: SPARK-35190
 URL: https://issues.apache.org/jira/browse/SPARK-35190
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: spark3.0

set spark.sql.hive.convertMetastoreOrc=true (default value in spark3.0)

set spark.sql.orc.impl=native(default velue in spark3.0)
Reporter: xiaoli


Before I address this issue, let me talk about the issue background: The 
current spark version we use is 2.2, and we plan to migrate to spark3.0 in near 
future. Before migration, we test some query in both spark2.2 and spark3.0 to 
check potential issue. The data source table of these query is orc format 
written by spark2.2.

 

I find that even if column pruning is applied, spark3.0’s native reader will 
read all columns.

 

Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will 
check whether field name is started with “_col”. In my case, field name is 
started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The code 
is below:

 

if (orcFieldNames.forall(_.startsWith("_col"))) {

  // This is a ORC file written by Hive, no field names in the physical schema, 
assume the

  // physical schema maps to the data scheme by index.

  _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema " +

    s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
physical schema, " +

    "no idea which columns were dropped, fail to read.")

  // for ORC file written by Hive, no field names

  // in the physical schema, there is a need to send the

  // entire dataSchema instead of required schema.

  // So pruneCols is not done in this case

  Some(requiredSchema.fieldNames.map { name =>

    val index = dataSchema.fieldIndex(name)

    if (index < orcFieldNames.length) {

      index

    } else {

      -1

    }

  }, false)
 
 Although this code comment explains reason, I still do not understand. This 
issue only happens in this case: spark3.0 uses native reader to read table 
written by spark2.2. 

 

In other cases, there is no such issue. I do another 2 tests:

Test1: use spark3.0’s hive reader (running with 
spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
the same table, it only reads pruned columns.

Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read 
this new table, it only reads pruned columns.

 

This issue I mentioned is a block we use native reader in spark3.0. Can anyone 
know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org