[jira] [Created] (SPARK-41231) Built-in SQL Function Improvement

2022-11-22 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41231:
-

 Summary: Built-in SQL Function Improvement
 Key: SPARK-41231
 URL: https://issues.apache.org/jira/browse/SPARK-41231
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.



  was:
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.




> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: db1.table_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637600#comment-17637600
 ] 

Apache Spark commented on SPARK-41228:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38769

> Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
> ---
>
> Key: SPARK-41228
> URL: https://issues.apache.org/jira/browse/SPARK-41228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The error class name name is tricky, so we should fix the name properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41228:


Assignee: (was: Apache Spark)

> Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
> ---
>
> Key: SPARK-41228
> URL: https://issues.apache.org/jira/browse/SPARK-41228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The error class name name is tricky, so we should fix the name properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637599#comment-17637599
 ] 

Apache Spark commented on SPARK-41228:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38769

> Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
> ---
>
> Key: SPARK-41228
> URL: https://issues.apache.org/jira/browse/SPARK-41228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The error class name name is tricky, so we should fix the name properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41228:


Assignee: Apache Spark

> Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
> ---
>
> Key: SPARK-41228
> URL: https://issues.apache.org/jira/browse/SPARK-41228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> The error class name name is tricky, so we should fix the name properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41230) Remove `str` from Aggregate expression type

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41230:


Assignee: Apache Spark

> Remove `str` from Aggregate expression type
> ---
>
> Key: SPARK-41230
> URL: https://issues.apache.org/jira/browse/SPARK-41230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41230) Remove `str` from Aggregate expression type

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637597#comment-17637597
 ] 

Apache Spark commented on SPARK-41230:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38768

> Remove `str` from Aggregate expression type
> ---
>
> Key: SPARK-41230
> URL: https://issues.apache.org/jira/browse/SPARK-41230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41230) Remove `str` from Aggregate expression type

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41230:


Assignee: (was: Apache Spark)

> Remove `str` from Aggregate expression type
> ---
>
> Key: SPARK-41230
> URL: https://issues.apache.org/jira/browse/SPARK-41230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41230) Remove `str` from Aggregate expression type

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637596#comment-17637596
 ] 

Apache Spark commented on SPARK-41230:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38768

> Remove `str` from Aggregate expression type
> ---
>
> Key: SPARK-41230
> URL: https://issues.apache.org/jira/browse/SPARK-41230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637595#comment-17637595
 ] 

jingxiong zhong commented on SPARK-41229:
-

[~cloud_fan] Could you help me about this?

> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41230) Remove `str` from Aggregate expression type

2022-11-22 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41230:
-
Summary: Remove `str` from Aggregate expression type  (was: Remove `str` 
from Aggregate)

> Remove `str` from Aggregate expression type
> ---
>
> Key: SPARK-41230
> URL: https://issues.apache.org/jira/browse/SPARK-41230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41230) Remove `str` from Aggregate

2022-11-22 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41230:
-
Summary: Remove `str` from Aggregate  (was: Remove `str` from Class 
Aggregate in Plan.py)

> Remove `str` from Aggregate
> ---
>
> Key: SPARK-41230
> URL: https://issues.apache.org/jira/browse/SPARK-41230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41227) Implement `DataFrame.crossJoin`

2022-11-22 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637593#comment-17637593
 ] 

Rui Wang commented on SPARK-41227:
--

+1 to have this to match existing PySpark API.

> Implement `DataFrame.crossJoin`
> ---
>
> Key: SPARK-41227
> URL: https://issues.apache.org/jira/browse/SPARK-41227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41183) Add an extension API to do plan normalization for caching

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637594#comment-17637594
 ] 

Apache Spark commented on SPARK-41183:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/38767

> Add an extension API to do plan normalization for caching
> -
>
> Key: SPARK-41183
> URL: https://issues.apache.org/jira/browse/SPARK-41183
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41230) Remove `str` from Class Aggregate in Plan.py

2022-11-22 Thread Rui Wang (Jira)
Rui Wang created SPARK-41230:


 Summary: Remove `str` from Class Aggregate in Plan.py
 Key: SPARK-41230
 URL: https://issues.apache.org/jira/browse/SPARK-41230
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637592#comment-17637592
 ] 

Apache Spark commented on SPARK-35531:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38765

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0, 3.1.4
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637591#comment-17637591
 ] 

Apache Spark commented on SPARK-35531:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38765

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0, 3.1.4
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.



  was:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```
It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```
It work well.
I'm a little confused. Is this syntax with database name not supported.




> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-41229:
---

 Summary: When using `db_ name.temp_ table_name`, an exception will 
be thrown
 Key: SPARK-41229
 URL: https://issues.apache.org/jira/browse/SPARK-41229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
 Environment: spark3.2.0
hadoop2.7.3
hive-ms 2.3.9
Reporter: jingxiong zhong


```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```
It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```
It work well.
I'm a little confused. Is this syntax with database name not supported.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread Raza Jafri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637587#comment-17637587
 ] 

Raza Jafri commented on SPARK-41219:


Thank you for looking into this issue. I have also noticed that the 
`IntegralDivide` has the output dataType = LongType, so why is it also 
overriding the `resultDecimalType`?? 

It will never be called AFAIK, it's only called from `dataType` in 
`BinaryArithmetic`

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40948) Introduce new error class: PATH_NOT_FOUND

2022-11-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40948.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38575
[https://github.com/apache/spark/pull/38575]

> Introduce new error class: PATH_NOT_FOUND
> -
>
> Key: SPARK-40948
> URL: https://issues.apache.org/jira/browse/SPARK-40948
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Recently we added many error classes named by LEGACY_ERROR_TEMP_.
> We should update them to use proper error class name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40948) Introduce new error class: PATH_NOT_FOUND

2022-11-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40948:


Assignee: Haejoon Lee

> Introduce new error class: PATH_NOT_FOUND
> -
>
> Key: SPARK-40948
> URL: https://issues.apache.org/jira/browse/SPARK-40948
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Recently we added many error classes named by LEGACY_ERROR_TEMP_.
> We should update them to use proper error class name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION

2022-11-22 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637567#comment-17637567
 ] 

Haejoon Lee commented on SPARK-41228:
-

I'm working on this

> Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
> ---
>
> Key: SPARK-41228
> URL: https://issues.apache.org/jira/browse/SPARK-41228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The error class name name is tricky, so we should fix the name properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION

2022-11-22 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-41228:
---

 Summary: Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to 
MISSING_AGGREGATION
 Key: SPARK-41228
 URL: https://issues.apache.org/jira/browse/SPARK-41228
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


The error class name name is tricky, so we should fix the name properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41206) Assign a name to the error class _LEGACY_ERROR_TEMP_1233

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637557#comment-17637557
 ] 

Apache Spark commented on SPARK-41206:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38764

> Assign a name to the error class _LEGACY_ERROR_TEMP_1233
> 
>
> Key: SPARK-41206
> URL: https://issues.apache.org/jira/browse/SPARK-41206
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Assign a proper name to the legacy error class _LEGACY_ERROR_TEMP_1233 and 
> make it visible to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41227) Implement `DataFrame.crossJoin`

2022-11-22 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-41227:


 Summary: Implement `DataFrame.crossJoin`
 Key: SPARK-41227
 URL: https://issues.apache.org/jira/browse/SPARK-41227
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637521#comment-17637521
 ] 

Apache Spark commented on SPARK-41201:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38763

> Implement `DataFrame.SelectExpr` in Python client
> -
>
> Key: SPARK-41201
> URL: https://issues.apache.org/jira/browse/SPARK-41201
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client

2022-11-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41201.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38723
[https://github.com/apache/spark/pull/38723]

> Implement `DataFrame.SelectExpr` in Python client
> -
>
> Key: SPARK-41201
> URL: https://issues.apache.org/jira/browse/SPARK-41201
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client

2022-11-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41201:


Assignee: Rui Wang

> Implement `DataFrame.SelectExpr` in Python client
> -
>
> Key: SPARK-41201
> URL: https://issues.apache.org/jira/browse/SPARK-41201
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41226) Refactor Spark types by introducing physical types

2022-11-22 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-41226:

Description: 
I am creating this one for Desmond Cheong since he can't signup for an account 
because of 
[https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
 
His description for this improvement:
The Spark type system currently supports multiple data types with the same 
physical representation in memory. For example {{DateType}} and 
{{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
Because of this, operations on data types often involve case matching where 
multiple data types match to the same effects.To simplify this case matching 
logic, we can introduce the notion of logical and physical data types where 
multiple logical data types can be implemented with the same physical data 
type, then perform case matching on physical data types.Some areas that can 
utilize this logical/physical type separation are:
 * {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
 * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
 * {{getAccessor}} in {{InternalRow.scala}}
 * {{externalDataTypeFor}} in {{RowEncoder.scala}}
 * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
 * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
 * {{doValidate}}  in {{literals.scala}}

  was:
I am creating this one for Desmond Cheong since he can't signup for an account 
because of 
[https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
 
His description for this improvement:
The Spark type system currently supports multiple data types with the same 
physical representation in memory. For example {{DateType}} and 
{{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
Because of this, operations on data types often involve case matching where 
multiple data types match to the same effects.To simplify this case matching 
logic, we can introduce the notion of logical and physical data types where 
multiple logical data types can be implemented with the same physical data 
type, then perform case matching on physical data types.Some areas that can 
utilize this logical/physical type separation are: * 
{{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
 * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
 * {{getAccessor}} in {{InternalRow.scala}}
 * {{externalDataTypeFor}} in {{RowEncoder.scala}}
 * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
 * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
 * {{doValidate}}  in {{literals.scala}}


> Refactor Spark types by introducing physical types
> --
>
> Key: SPARK-41226
> URL: https://issues.apache.org/jira/browse/SPARK-41226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> I am creating this one for Desmond Cheong since he can't signup for an 
> account because of 
> [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
>  
> His description for this improvement:
> The Spark type system currently supports multiple data types with the same 
> physical representation in memory. For example {{DateType}} and 
> {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
> Because of this, operations on data types often involve case matching where 
> multiple data types match to the same effects.To simplify this case matching 
> logic, we can introduce the notion of logical and physical data types where 
> multiple logical data types can be implemented with the same physical data 
> type, then perform case matching on physical data types.Some areas that can 
> utilize this logical/physical type separation are:
>  * {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
>  * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
>  * {{getAccessor}} in {{InternalRow.scala}}
>  * {{externalDataTypeFor}} in {{RowEncoder.scala}}
>  * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
>  * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
>  * {{doValidate}}  in {{literals.scala}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-39591:
-
Shepherd: Jungtaek Lim

> SPIP: Asynchronous Offset Management in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.
>  
> SPIP Doc: 
>  
> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41226) Refactor Spark types by introducing physical types

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41226:


Assignee: Apache Spark

> Refactor Spark types by introducing physical types
> --
>
> Key: SPARK-41226
> URL: https://issues.apache.org/jira/browse/SPARK-41226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> I am creating this one for Desmond Cheong since he can't signup for an 
> account because of 
> [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
>  
> His description for this improvement:
> The Spark type system currently supports multiple data types with the same 
> physical representation in memory. For example {{DateType}} and 
> {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
> Because of this, operations on data types often involve case matching where 
> multiple data types match to the same effects.To simplify this case matching 
> logic, we can introduce the notion of logical and physical data types where 
> multiple logical data types can be implemented with the same physical data 
> type, then perform case matching on physical data types.Some areas that can 
> utilize this logical/physical type separation are: * 
> {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
>  * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
>  * {{getAccessor}} in {{InternalRow.scala}}
>  * {{externalDataTypeFor}} in {{RowEncoder.scala}}
>  * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
>  * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
>  * {{doValidate}}  in {{literals.scala}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41226) Refactor Spark types by introducing physical types

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41226:


Assignee: (was: Apache Spark)

> Refactor Spark types by introducing physical types
> --
>
> Key: SPARK-41226
> URL: https://issues.apache.org/jira/browse/SPARK-41226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> I am creating this one for Desmond Cheong since he can't signup for an 
> account because of 
> [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
>  
> His description for this improvement:
> The Spark type system currently supports multiple data types with the same 
> physical representation in memory. For example {{DateType}} and 
> {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
> Because of this, operations on data types often involve case matching where 
> multiple data types match to the same effects.To simplify this case matching 
> logic, we can introduce the notion of logical and physical data types where 
> multiple logical data types can be implemented with the same physical data 
> type, then perform case matching on physical data types.Some areas that can 
> utilize this logical/physical type separation are: * 
> {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
>  * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
>  * {{getAccessor}} in {{InternalRow.scala}}
>  * {{externalDataTypeFor}} in {{RowEncoder.scala}}
>  * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
>  * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
>  * {{doValidate}}  in {{literals.scala}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41226) Refactor Spark types by introducing physical types

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637464#comment-17637464
 ] 

Apache Spark commented on SPARK-41226:
--

User 'desmondcheongzx' has created a pull request for this issue:
https://github.com/apache/spark/pull/38750

> Refactor Spark types by introducing physical types
> --
>
> Key: SPARK-41226
> URL: https://issues.apache.org/jira/browse/SPARK-41226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> I am creating this one for Desmond Cheong since he can't signup for an 
> account because of 
> [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
>  
> His description for this improvement:
> The Spark type system currently supports multiple data types with the same 
> physical representation in memory. For example {{DateType}} and 
> {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
> Because of this, operations on data types often involve case matching where 
> multiple data types match to the same effects.To simplify this case matching 
> logic, we can introduce the notion of logical and physical data types where 
> multiple logical data types can be implemented with the same physical data 
> type, then perform case matching on physical data types.Some areas that can 
> utilize this logical/physical type separation are: * 
> {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
>  * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
>  * {{getAccessor}} in {{InternalRow.scala}}
>  * {{externalDataTypeFor}} in {{RowEncoder.scala}}
>  * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
>  * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
>  * {{doValidate}}  in {{literals.scala}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41226) Refactor Spark types by introducing physical types

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637463#comment-17637463
 ] 

Apache Spark commented on SPARK-41226:
--

User 'desmondcheongzx' has created a pull request for this issue:
https://github.com/apache/spark/pull/38750

> Refactor Spark types by introducing physical types
> --
>
> Key: SPARK-41226
> URL: https://issues.apache.org/jira/browse/SPARK-41226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> I am creating this one for Desmond Cheong since he can't signup for an 
> account because of 
> [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
>  
> His description for this improvement:
> The Spark type system currently supports multiple data types with the same 
> physical representation in memory. For example {{DateType}} and 
> {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
> Because of this, operations on data types often involve case matching where 
> multiple data types match to the same effects.To simplify this case matching 
> logic, we can introduce the notion of logical and physical data types where 
> multiple logical data types can be implemented with the same physical data 
> type, then perform case matching on physical data types.Some areas that can 
> utilize this logical/physical type separation are: * 
> {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
>  * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
>  * {{getAccessor}} in {{InternalRow.scala}}
>  * {{externalDataTypeFor}} in {{RowEncoder.scala}}
>  * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
>  * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
>  * {{doValidate}}  in {{literals.scala}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41226) Refactor Spark types by introducing physical types

2022-11-22 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41226:
--

 Summary: Refactor Spark types by introducing physical types
 Key: SPARK-41226
 URL: https://issues.apache.org/jira/browse/SPARK-41226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang


I am creating this one for Desmond Cheong since he can't signup for an account 
because of 
[https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html]
 
His description for this improvement:
The Spark type system currently supports multiple data types with the same 
physical representation in memory. For example {{DateType}} and 
{{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. 
Because of this, operations on data types often involve case matching where 
multiple data types match to the same effects.To simplify this case matching 
logic, we can introduce the notion of logical and physical data types where 
multiple logical data types can be implemented with the same physical data 
type, then perform case matching on physical data types.Some areas that can 
utilize this logical/physical type separation are: * 
{{SpecializedGettersReader}} in {{SpecializedGettersReader.java}}
 * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}}
 * {{getAccessor}} in {{InternalRow.scala}}
 * {{externalDataTypeFor}} in {{RowEncoder.scala}}
 * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}}
 * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}}
 * {{doValidate}}  in {{literals.scala}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41225) Disable unsupported functions

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41225:


Assignee: (was: Apache Spark)

> Disable unsupported functions
> -
>
> Key: SPARK-41225
> URL: https://issues.apache.org/jira/browse/SPARK-41225
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>
> Disable unsupported functions and throw a proper NotImplementedError in the 
> Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41225) Disable unsupported functions

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41225:


Assignee: Apache Spark

> Disable unsupported functions
> -
>
> Key: SPARK-41225
> URL: https://issues.apache.org/jira/browse/SPARK-41225
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>
> Disable unsupported functions and throw a proper NotImplementedError in the 
> Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41225) Disable unsupported functions

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637457#comment-17637457
 ] 

Apache Spark commented on SPARK-41225:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/38762

> Disable unsupported functions
> -
>
> Key: SPARK-41225
> URL: https://issues.apache.org/jira/browse/SPARK-41225
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>
> Disable unsupported functions and throw a proper NotImplementedError in the 
> Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41225) Disable unsupported functions

2022-11-22 Thread Martin Grund (Jira)
Martin Grund created SPARK-41225:


 Summary: Disable unsupported functions
 Key: SPARK-41225
 URL: https://issues.apache.org/jira/browse/SPARK-41225
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Martin Grund


Disable unsupported functions and throw a proper NotImplementedError in the 
Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Description: 
Currently in Structured Streaming, at the beginning of every micro-batch the 
offset to process up to for the current batch is persisted to durable storage.  
At the end of every micro-batch, a marker to indicate the completion of this 
current micro-batch is persisted to durable storage. For pipelines such as one 
that read from Kafka and write to Kafka, end-to-end exactly once is not support 
and latency is sensitive, we can allow users to configure offset commits to be 
written asynchronously thus this commit operation will not contribute to the 
batch duration and effectively lowering the overall latency of the pipeline.

 

SPIP Doc: 

 

https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing

  was:Currently in Structured Streaming, at the beginning of every micro-batch 
the offset to process up to for the current batch is persisted to durable 
storage.  At the end of every micro-batch, a marker to indicate the completion 
of this current micro-batch is persisted to durable storage. For pipelines such 
as one that read from Kafka and write to Kafka, end-to-end exactly once is not 
support and latency is sensitive, we can allow users to configure offset 
commits to be written asynchronously thus this commit operation will not 
contribute to the batch duration and effectively lowering the overall latency 
of the pipeline.


> SPIP: Offset Management Improvements in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.
>  
> SPIP Doc: 
>  
> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Summary: SPIP: Asynchronous Offset Management in Structured Streaming  
(was: SPIP: Offset Management Improvements in Structured Streaming)

> SPIP: Asynchronous Offset Management in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.
>  
> SPIP Doc: 
>  
> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Summary: SPIP: Offset Management Improvements in Structured Streaming  
(was: Offset Management Improvements in Structured Streaming)

> SPIP: Offset Management Improvements in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Labels: SPIP  (was: )

> SPIP: Offset Management Improvements in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Labels: SPIP  (was: )

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Labels:   (was: SPIP)

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2022-11-22 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-41053:
---
Description: 
After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
processing large applications by supporting a persistent 
KV-store(LevelDB/RocksDB) as the storage layer.

As for the live Spark UI, all the data is still stored in memory, which can 
bring memory pressures to the Spark driver for large applications.

For better Spark UI scalability and Driver stability, I propose to
 * {*}Support storing all the UI data in a persistent KV store{*}. 
RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
fast enough to serve the write/read workload for live UI. SHS can leverage the 
persistent KV store to fasten its startup.

 * *Support a new Protobuf serializer for all the UI data.* The new serializer 
is supposed to be faster, according to benchmarks. It will be the default 
serializer for the persistent KV store of live UI. As for event logs, it is 
optional. The current serializer for UI data is JSON. When writing persistent 
KV-store, there is GZip compression. Since there is compression support in 
RocksDB/LevelDB, the new serializer won’t compress the output before writing to 
the persistent KV store. Here is a benchmark of writing/reading 100,000 
SQLExecutionUIData to/from RocksDB:

 
|*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
Size(MB)*|*Result total size in memory(MB)*|
|*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
|*Protobuf*|109.9|34.3|858|2105|

I am also proposing to support RocksDB instead of both LevelDB & RocksDB in the 
live UI.

SPIP: 
[https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]

SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj

  was:
After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
processing large applications by supporting a persistent 
KV-store(LevelDB/RocksDB) as the storage layer.

As for the live Spark UI, all the data is still stored in memory, which can 
bring memory pressures to the Spark driver for large applications.

For better Spark UI scalability and Driver stability, I propose to
 * {*}Support storing all the UI data in a persistent KV store{*}. 
RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
fast enough to serve the write/read workload for live UI. SHS can leverage the 
persistent KV store to fasten its startup.

 * *Support a new Protobuf serializer for all the UI data.* The new serializer 
is supposed to be faster, according to benchmarks. It will be the default 
serializer for the persistent KV store of live UI. As for event logs, it is 
optional. The current serializer for UI data is JSON. When writing persistent 
KV-store, there is GZip compression. Since there is compression support in 
RocksDB/LevelDB, the new serializer won’t compress the output before writing to 
the persistent KV store. Here is a benchmark of writing/reading 100,000 
SQLExecutionUIData to/from RocksDB:

 
|*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
Size(MB)*|*Result total size in memory(MB)*|
|*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
|*Protobuf*|109.9|34.3|858|2105|

I am also proposing to support RocksDB instead of both LevelDB & RocksDB in the 
live UI.

SPIP: 
[https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]


> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to 

[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2022-11-22 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-41053:
---
Attachment: Better Spark UI scalability and Driver stability for large 
applications.pdf

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41054) Support disk-based KVStore in live UI

2022-11-22 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41054.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38567
[https://github.com/apache/spark/pull/38567]

> Support disk-based KVStore in live UI
> -
>
> Key: SPARK-41054
> URL: https://issues.apache.org/jira/browse/SPARK-41054
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37313) Child stage using merged output or not should be based on the availability of merged output from parent stage

2022-11-22 Thread Mars (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37313 ]


Mars deleted comment on SPARK-37313:
--

was (Author: JIRAUSER290821):
as comment said 
[https://github.com/apache/spark/pull/34461#issuecomment-964557253]
I'm working on this Issue and trying to implement this functionality [~minyang] 
[~mridul] 

> Child stage using merged output or not should be based on the availability of 
> merged output from parent stage
> -
>
> Key: SPARK-37313
> URL: https://issues.apache.org/jira/browse/SPARK-37313
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.1
>Reporter: Minchu Yang
>Priority: Minor
>
> As discussed in the 
> [thread|https://github.com/apache/spark/pull/34461#pullrequestreview-799701494]
>  in SPARK-37023, during a stage retry, if parent stage has already generated 
> merged output in the previous attempt, with current behavior, the child stage 
> would not able to fetch the merged output, as this is controlled by 
> dependency.shuffleMergeEnabled (see current implementation 
> [here|https://github.com/apache/spark/blob/31b6f614d3173c8a5852243bf7d0b6200788432d/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L134-L136])
>  during the stage retry.
> Instead of using a single variable to control behavior at both mapper side 
> (push side) and reducer side (using merged output), whether child stage uses 
> merged output or not must only be based on whether merged output is available 
> for it to use(as discussed 
> [here|https://github.com/apache/spark/pull/34461#issuecomment-964557253]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40988) Test case for insert partition should verify value

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637252#comment-17637252
 ] 

Apache Spark commented on SPARK-40988:
--

User 'rangareddy' has created a pull request for this issue:
https://github.com/apache/spark/pull/38761

> Test case for insert partition should verify value 
> ---
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Minor
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40988) Test case for insert partition should verify value

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40988:


Assignee: Apache Spark

> Test case for insert partition should verify value 
> ---
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Assignee: Apache Spark
>Priority: Minor
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40988) Test case for insert partition should verify value

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40988:


Assignee: (was: Apache Spark)

> Test case for insert partition should verify value 
> ---
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Minor
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40988) Test case for insert partition should verify value

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637251#comment-17637251
 ] 

Apache Spark commented on SPARK-40988:
--

User 'rangareddy' has created a pull request for this issue:
https://github.com/apache/spark/pull/38761

> Test case for insert partition should verify value 
> ---
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Minor
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637239#comment-17637239
 ] 

Apache Spark commented on SPARK-41219:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38760

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41219:


Assignee: (was: Apache Spark)

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41219:


Assignee: Apache Spark

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Assignee: Apache Spark
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41184) Fill NA tests are flaky

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637223#comment-17637223
 ] 

Apache Spark commented on SPARK-41184:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38759

> Fill NA tests are flaky
> ---
>
> Key: SPARK-41184
> URL: https://issues.apache.org/jira/browse/SPARK-41184
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Connect's fill.na tests for python are flakey. We need to disable them, and 
> investigate what is going on with the typing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41184) Fill NA tests are flaky

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637222#comment-17637222
 ] 

Apache Spark commented on SPARK-41184:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38759

> Fill NA tests are flaky
> ---
>
> Key: SPARK-41184
> URL: https://issues.apache.org/jira/browse/SPARK-41184
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Connect's fill.na tests for python are flakey. We need to disable them, and 
> investigate what is going on with the typing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41184) Fill NA tests are flaky

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637221#comment-17637221
 ] 

Apache Spark commented on SPARK-41184:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38759

> Fill NA tests are flaky
> ---
>
> Key: SPARK-41184
> URL: https://issues.apache.org/jira/browse/SPARK-41184
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Connect's fill.na tests for python are flakey. We need to disable them, and 
> investigate what is going on with the typing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41165) Arrow collect should factor in failures

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637219#comment-17637219
 ] 

Apache Spark commented on SPARK-41165:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38759

> Arrow collect should factor in failures
> ---
>
> Key: SPARK-41165
> URL: https://issues.apache.org/jira/browse/SPARK-41165
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Connect's arrow collect path does not factor in failures. If a failure occurs 
> the collect code path will hang.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41165) Arrow collect should factor in failures

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637220#comment-17637220
 ] 

Apache Spark commented on SPARK-41165:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38759

> Arrow collect should factor in failures
> ---
>
> Key: SPARK-41165
> URL: https://issues.apache.org/jira/browse/SPARK-41165
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Connect's arrow collect path does not factor in failures. If a failure occurs 
> the collect code path will hang.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41224) Optimize Arrow collect to stream the result from server to client

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637218#comment-17637218
 ] 

Apache Spark commented on SPARK-41224:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38759

> Optimize Arrow collect to stream the result from server to client
> -
>
> Key: SPARK-41224
> URL: https://issues.apache.org/jira/browse/SPARK-41224
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/38468 implemented Arrow-based collect 
> but they cannot stream the result from server to the client. We can stream 
> them if the first partition is collected first



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41224) Optimize Arrow collect to stream the result from server to client

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41224:


Assignee: Apache Spark

> Optimize Arrow collect to stream the result from server to client
> -
>
> Key: SPARK-41224
> URL: https://issues.apache.org/jira/browse/SPARK-41224
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/pull/38468 implemented Arrow-based collect 
> but they cannot stream the result from server to the client. We can stream 
> them if the first partition is collected first



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41224) Optimize Arrow collect to stream the result from server to client

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41224:


Assignee: (was: Apache Spark)

> Optimize Arrow collect to stream the result from server to client
> -
>
> Key: SPARK-41224
> URL: https://issues.apache.org/jira/browse/SPARK-41224
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/38468 implemented Arrow-based collect 
> but they cannot stream the result from server to the client. We can stream 
> them if the first partition is collected first



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637216#comment-17637216
 ] 

XiDuo You edited comment on SPARK-41219 at 11/22/22 11:59 AM:
--

it seems the root reason is decimal.toPrecision will break when change to 
decimal(0, 0)
{code:java}
val df = Seq(0).toDF("a")
// return 0
df.selectExpr("cast(0 as decimal(0,0))").show
// reutrn 0
df.select(lit(BigDecimal(0)) as "c").show
// return null
df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show
{code}


was (Author: ulysses):
it seems the root reason is decimal.toPrecision will break when change to 
decimal(0, 0)

{code:java}
val df = Seq(0).toDF("a")
// return 0
df.selectExpr("cast(0 as decimal(0,0))").show
// return null
df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show
{code}

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637216#comment-17637216
 ] 

XiDuo You commented on SPARK-41219:
---

it seems the root reason is decimal.toPrecision will break when change to 
decimal(0, 0)

{code:java}
val df = Seq(0).toDF("a")
// return 0
df.selectExpr("cast(0 as decimal(0,0))").show
// return null
df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show
{code}

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41224) Optimize Arrow collect to stream the result from server to client

2022-11-22 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41224:


 Summary: Optimize Arrow collect to stream the result from server 
to client
 Key: SPARK-41224
 URL: https://issues.apache.org/jira/browse/SPARK-41224
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/pull/38468 implemented Arrow-based collect but 
they cannot stream the result from server to the client. We can stream them if 
the first partition is collected first



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41180) Assign an error class to "Cannot parse the data type"

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41180:


Assignee: (was: Apache Spark)

> Assign an error class to "Cannot parse the data type"
> -
>
> Key: SPARK-41180
> URL: https://issues.apache.org/jira/browse/SPARK-41180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> The code below shows the issue:
> {code}
> > select from_csv('1', 'a InvalidType');
> org.apache.spark.sql.AnalysisException
> {
>   "errorClass" : "LEGACY",
>   "messageParameters" : {
> "message" : "Cannot parse the data type: \n[PARSE_SYNTAX_ERROR] Syntax 
> error at or near 'InvalidType': extra input 'InvalidType'(line 1, pos 
> 2)\n\n== SQL ==\na InvalidType\n--^^^\n\nFailed fallback parsing: \nDataType 
> invalidtype is not supported.(line 1, pos 2)\n\n== SQL ==\na 
> InvalidType\n--^^^\n; line 1 pos 7"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41180) Assign an error class to "Cannot parse the data type"

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41180:


Assignee: Apache Spark

> Assign an error class to "Cannot parse the data type"
> -
>
> Key: SPARK-41180
> URL: https://issues.apache.org/jira/browse/SPARK-41180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The code below shows the issue:
> {code}
> > select from_csv('1', 'a InvalidType');
> org.apache.spark.sql.AnalysisException
> {
>   "errorClass" : "LEGACY",
>   "messageParameters" : {
> "message" : "Cannot parse the data type: \n[PARSE_SYNTAX_ERROR] Syntax 
> error at or near 'InvalidType': extra input 'InvalidType'(line 1, pos 
> 2)\n\n== SQL ==\na InvalidType\n--^^^\n\nFailed fallback parsing: \nDataType 
> invalidtype is not supported.(line 1, pos 2)\n\n== SQL ==\na 
> InvalidType\n--^^^\n; line 1 pos 7"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41223) Upgrade slf4j to 2.0.4

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637207#comment-17637207
 ] 

Apache Spark commented on SPARK-41223:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38758

> Upgrade slf4j to 2.0.4
> --
>
> Key: SPARK-41223
> URL: https://issues.apache.org/jira/browse/SPARK-41223
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41223) Upgrade slf4j to 2.0.4

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637205#comment-17637205
 ] 

Apache Spark commented on SPARK-41223:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38758

> Upgrade slf4j to 2.0.4
> --
>
> Key: SPARK-41223
> URL: https://issues.apache.org/jira/browse/SPARK-41223
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41223) Upgrade slf4j to 2.0.4

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41223:


Assignee: Apache Spark

> Upgrade slf4j to 2.0.4
> --
>
> Key: SPARK-41223
> URL: https://issues.apache.org/jira/browse/SPARK-41223
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41223) Upgrade slf4j to 2.0.4

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41223:


Assignee: (was: Apache Spark)

> Upgrade slf4j to 2.0.4
> --
>
> Key: SPARK-41223
> URL: https://issues.apache.org/jira/browse/SPARK-41223
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637201#comment-17637201
 ] 

XiDuo You commented on SPARK-41219:
---

I'm looking at this

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41223) Upgrade slf4j to 2.0.4

2022-11-22 Thread Yang Jie (Jira)
Yang Jie created SPARK-41223:


 Summary: Upgrade slf4j to 2.0.4
 Key: SPARK-41223
 URL: https://issues.apache.org/jira/browse/SPARK-41223
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


https://www.slf4j.org/news.html#2.0.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41222) Unify the typing definitions

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637199#comment-17637199
 ] 

Apache Spark commented on SPARK-41222:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38757

> Unify the typing definitions
> 
>
> Key: SPARK-41222
> URL: https://issues.apache.org/jira/browse/SPARK-41222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41222) Unify the typing definitions

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41222:


Assignee: (was: Apache Spark)

> Unify the typing definitions
> 
>
> Key: SPARK-41222
> URL: https://issues.apache.org/jira/browse/SPARK-41222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41222) Unify the typing definitions

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41222:


Assignee: Apache Spark

> Unify the typing definitions
> 
>
> Key: SPARK-41222
> URL: https://issues.apache.org/jira/browse/SPARK-41222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41222) Unify the typing definitions

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637195#comment-17637195
 ] 

Apache Spark commented on SPARK-41222:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38757

> Unify the typing definitions
> 
>
> Key: SPARK-41222
> URL: https://issues.apache.org/jira/browse/SPARK-41222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41222) Unify the typing definitions

2022-11-22 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41222:
-

 Summary: Unify the typing definitions
 Key: SPARK-41222
 URL: https://issues.apache.org/jira/browse/SPARK-41222
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41220) Range partitioner sample supports column pruning

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637149#comment-17637149
 ] 

Apache Spark commented on SPARK-41220:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38756

> Range partitioner sample supports column pruning
> 
>
> Key: SPARK-41220
> URL: https://issues.apache.org/jira/browse/SPARK-41220
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> When do a global sort, firstly we do sample to get range bounds, then we use 
> the range partitioner to do shuffle exchange.
> The issue is, the sample plan is coupled with the shuffle plan that causes we 
> can not optimize the sample plan. What we need for sample plan is the columns 
> for sort order but the shuffle plan contains all data columns.So at least, we 
> can do column pruning for the sample plan to only fetch the ordering columns.
> A common example is: `OPTIMIZE table ZORDER BY columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41220) Range partitioner sample supports column pruning

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41220:


Assignee: (was: Apache Spark)

> Range partitioner sample supports column pruning
> 
>
> Key: SPARK-41220
> URL: https://issues.apache.org/jira/browse/SPARK-41220
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> When do a global sort, firstly we do sample to get range bounds, then we use 
> the range partitioner to do shuffle exchange.
> The issue is, the sample plan is coupled with the shuffle plan that causes we 
> can not optimize the sample plan. What we need for sample plan is the columns 
> for sort order but the shuffle plan contains all data columns.So at least, we 
> can do column pruning for the sample plan to only fetch the ordering columns.
> A common example is: `OPTIMIZE table ZORDER BY columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41220) Range partitioner sample supports column pruning

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41220:


Assignee: Apache Spark

> Range partitioner sample supports column pruning
> 
>
> Key: SPARK-41220
> URL: https://issues.apache.org/jira/browse/SPARK-41220
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> When do a global sort, firstly we do sample to get range bounds, then we use 
> the range partitioner to do shuffle exchange.
> The issue is, the sample plan is coupled with the shuffle plan that causes we 
> can not optimize the sample plan. What we need for sample plan is the columns 
> for sort order but the shuffle plan contains all data columns.So at least, we 
> can do column pruning for the sample plan to only fetch the ordering columns.
> A common example is: `OPTIMIZE table ZORDER BY columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41220) Range partitioner sample supports column pruning

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637148#comment-17637148
 ] 

Apache Spark commented on SPARK-41220:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38756

> Range partitioner sample supports column pruning
> 
>
> Key: SPARK-41220
> URL: https://issues.apache.org/jira/browse/SPARK-41220
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> When do a global sort, firstly we do sample to get range bounds, then we use 
> the range partitioner to do shuffle exchange.
> The issue is, the sample plan is coupled with the shuffle plan that causes we 
> can not optimize the sample plan. What we need for sample plan is the columns 
> for sort order but the shuffle plan contains all data columns.So at least, we 
> can do column pruning for the sample plan to only fetch the ordering columns.
> A common example is: `OPTIMIZE table ZORDER BY columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41135) Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION

2022-11-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41135:


Assignee: Haejoon Lee

> Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION
> ---
>
> Key: SPARK-41135
> URL: https://issues.apache.org/jira/browse/SPARK-41135
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> The name of `UNSUPPORTED_EMPTY_LOCATION` can be improved with its error 
> message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41135) Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION

2022-11-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41135.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38650
[https://github.com/apache/spark/pull/38650]

> Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION
> ---
>
> Key: SPARK-41135
> URL: https://issues.apache.org/jira/browse/SPARK-41135
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> The name of `UNSUPPORTED_EMPTY_LOCATION` can be improved with its error 
> message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41221) Add the error class INVALID_FORMAT

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637129#comment-17637129
 ] 

Apache Spark commented on SPARK-41221:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38755

> Add the error class INVALID_FORMAT
> --
>
> Key: SPARK-41221
> URL: https://issues.apache.org/jira/browse/SPARK-41221
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Introduce new error class for the errors related to invalid format or pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41221) Add the error class INVALID_FORMAT

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41221:


Assignee: Max Gekk  (was: Apache Spark)

> Add the error class INVALID_FORMAT
> --
>
> Key: SPARK-41221
> URL: https://issues.apache.org/jira/browse/SPARK-41221
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Introduce new error class for the errors related to invalid format or pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41221) Add the error class INVALID_FORMAT

2022-11-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637128#comment-17637128
 ] 

Apache Spark commented on SPARK-41221:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38755

> Add the error class INVALID_FORMAT
> --
>
> Key: SPARK-41221
> URL: https://issues.apache.org/jira/browse/SPARK-41221
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Introduce new error class for the errors related to invalid format or pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41221) Add the error class INVALID_FORMAT

2022-11-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41221:


Assignee: Apache Spark  (was: Max Gekk)

> Add the error class INVALID_FORMAT
> --
>
> Key: SPARK-41221
> URL: https://issues.apache.org/jira/browse/SPARK-41221
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Introduce new error class for the errors related to invalid format or pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41221) Add the error class INVALID_FORMAT

2022-11-22 Thread Max Gekk (Jira)
Max Gekk created SPARK-41221:


 Summary: Add the error class INVALID_FORMAT
 Key: SPARK-41221
 URL: https://issues.apache.org/jira/browse/SPARK-41221
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Introduce new error class for the errors related to invalid format or pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41212) Implement `DataFrame.isEmpty`

2022-11-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41212.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38734
[https://github.com/apache/spark/pull/38734]

> Implement `DataFrame.isEmpty`
> -
>
> Key: SPARK-41212
> URL: https://issues.apache.org/jira/browse/SPARK-41212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41212) Implement `DataFrame.isEmpty`

2022-11-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41212:
-

Assignee: Ruifeng Zheng

> Implement `DataFrame.isEmpty`
> -
>
> Key: SPARK-41212
> URL: https://issues.apache.org/jira/browse/SPARK-41212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637102#comment-17637102
 ] 

Yuming Wang commented on SPARK-41219:
-

cc [~ulysses]

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-11-22 Thread Gabor Roczei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637096#comment-17637096
 ] 

Gabor Roczei commented on SPARK-38230:
--

Hi [~coalchan],

[Your pull request|https://github.com/apache/spark/pull/35549] has been 
automatically closed by the github action, I would like to create a new pull 
request based on yours and continue to work on this if you agree.
 

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41220) Range partitioner sample supports column pruning

2022-11-22 Thread XiDuo You (Jira)
XiDuo You created SPARK-41220:
-

 Summary: Range partitioner sample supports column pruning
 Key: SPARK-41220
 URL: https://issues.apache.org/jira/browse/SPARK-41220
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


When do a global sort, firstly we do sample to get range bounds, then we use 
the range partitioner to do shuffle exchange.
The issue is, the sample plan is coupled with the shuffle plan that causes we 
can not optimize the sample plan. What we need for sample plan is the columns 
for sort order but the shuffle plan contains all data columns.So at least, we 
can do column pruning for the sample plan to only fetch the ordering columns.

A common example is: `OPTIMIZE table ZORDER BY columns`





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org