[jira] [Commented] (SPARK-40852) Implement `DataFrame.summary`

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644199#comment-17644199
 ] 

Apache Spark commented on SPARK-40852:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/38962

> Implement `DataFrame.summary`
> -
>
> Key: SPARK-40852
> URL: https://issues.apache.org/jira/browse/SPARK-40852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40852) Implement `DataFrame.summary`

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644200#comment-17644200
 ] 

Apache Spark commented on SPARK-40852:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/38962

> Implement `DataFrame.summary`
> -
>
> Key: SPARK-40852
> URL: https://issues.apache.org/jira/browse/SPARK-40852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41381) Implement count_distinct and sum_distinct functions

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41381:
-

Assignee: Ruifeng Zheng

> Implement count_distinct and sum_distinct functions
> ---
>
> Key: SPARK-41381
> URL: https://issues.apache.org/jira/browse/SPARK-41381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41381) Implement count_distinct and sum_distinct functions

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41381.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38914
[https://github.com/apache/spark/pull/38914]

> Implement count_distinct and sum_distinct functions
> ---
>
> Key: SPARK-41381
> URL: https://issues.apache.org/jira/browse/SPARK-41381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41437) Do not optimize the input query twice for v1 write fallback

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644185#comment-17644185
 ] 

Apache Spark commented on SPARK-41437:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/38942

> Do not optimize the input query twice for v1 write fallback
> ---
>
> Key: SPARK-41437
> URL: https://issues.apache.org/jira/browse/SPARK-41437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41437) Do not optimize the input query twice for v1 write fallback

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41437:


Assignee: (was: Apache Spark)

> Do not optimize the input query twice for v1 write fallback
> ---
>
> Key: SPARK-41437
> URL: https://issues.apache.org/jira/browse/SPARK-41437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41437) Do not optimize the input query twice for v1 write fallback

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41437:


Assignee: Apache Spark

> Do not optimize the input query twice for v1 write fallback
> ---
>
> Key: SPARK-41437
> URL: https://issues.apache.org/jira/browse/SPARK-41437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41437) Do not optimize the input query twice for v1 write fallback

2022-12-06 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-41437:
---

 Summary: Do not optimize the input query twice for v1 write 
fallback
 Key: SPARK-41437
 URL: https://issues.apache.org/jira/browse/SPARK-41437
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41283) Feature parity: Functions API in Spark Connect

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41283:
--
Summary: Feature parity: Functions API in Spark Connect  (was: Feature 
parity: functions API in Spark Connect)

> Feature parity: Functions API in Spark Connect
> --
>
> Key: SPARK-41283
> URL: https://issues.apache.org/jira/browse/SPARK-41283
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Xinrong Meng
>Priority: Critical
>
> Implement functions API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644175#comment-17644175
 ] 

Zhe Dong commented on SPARK-41386:
--

Hi. [~podongfeng] 

That was my mistake. I removed it. sorry for that.

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41436) Implement `collection` functions: A~C

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41436:


Assignee: (was: Apache Spark)

> Implement `collection` functions: A~C
> -
>
> Key: SPARK-41436
> URL: https://issues.apache.org/jira/browse/SPARK-41436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41436) Implement `collection` functions: A~C

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41436:


Assignee: Apache Spark

> Implement `collection` functions: A~C
> -
>
> Key: SPARK-41436
> URL: https://issues.apache.org/jira/browse/SPARK-41436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41436) Implement `collection` functions: A~C

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644174#comment-17644174
 ] 

Apache Spark commented on SPARK-41436:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38961

> Implement `collection` functions: A~C
> -
>
> Key: SPARK-41436
> URL: https://issues.apache.org/jira/browse/SPARK-41436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Epic Link:   (was: SPARK-39375)

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41436) Implement `collection` functions: A~C

2022-12-06 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41436:
-

 Summary: Implement `collection` functions: A~C
 Key: SPARK-41436
 URL: https://issues.apache.org/jira/browse/SPARK-41436
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644173#comment-17644173
 ] 

Apache Spark commented on SPARK-41435:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38960

> Make `curdate()` throw `WRONG_NUM_ARGS ` instead of  `_LEGACY_ERROR_TEMP_1043 
> ` when args is not null
> -
>
> Key: SPARK-41435
> URL: https://issues.apache.org/jira/browse/SPARK-41435
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41435:


Assignee: (was: Apache Spark)

> Make `curdate()` throw `WRONG_NUM_ARGS ` instead of  `_LEGACY_ERROR_TEMP_1043 
> ` when args is not null
> -
>
> Key: SPARK-41435
> URL: https://issues.apache.org/jira/browse/SPARK-41435
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41435:


Assignee: Apache Spark

> Make `curdate()` throw `WRONG_NUM_ARGS ` instead of  `_LEGACY_ERROR_TEMP_1043 
> ` when args is not null
> -
>
> Key: SPARK-41435
> URL: https://issues.apache.org/jira/browse/SPARK-41435
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644172#comment-17644172
 ] 

Apache Spark commented on SPARK-41435:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38960

> Make `curdate()` throw `WRONG_NUM_ARGS ` instead of  `_LEGACY_ERROR_TEMP_1043 
> ` when args is not null
> -
>
> Key: SPARK-41435
> URL: https://issues.apache.org/jira/browse/SPARK-41435
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null

2022-12-06 Thread Yang Jie (Jira)
Yang Jie created SPARK-41435:


 Summary: Make `curdate()` throw `WRONG_NUM_ARGS ` instead of  
`_LEGACY_ERROR_TEMP_1043 ` when args is not null
 Key: SPARK-41435
 URL: https://issues.apache.org/jira/browse/SPARK-41435
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41298) Getting Count on data frame is giving the performance issue

2022-12-06 Thread Ramakrishna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644164#comment-17644164
 ] 

Ramakrishna commented on SPARK-41298:
-

Can some one please check behavior and update me asap.

> Getting Count on data frame is giving the performance issue
> ---
>
> Key: SPARK-41298
> URL: https://issues.apache.org/jira/browse/SPARK-41298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Ramakrishna
>Priority: Major
>
> We are invoking  below query on Teradata 
> 1) Dataframe df = spark.format("jdbc"). . . load();
> 2) int count = df.count();
> When we executed the df.count spark internally issuing the below query on 
> teradata which is wasting the lot of CPU on teradata and DBAs are making 
> noise by seeing this query.
>  
> Query : SELECT 1 FROM ()SPARK_SUB_TAB
> Response:
> 1
> 1
> 1
> 1
> 1
> ..
> 1
>  
> Is this expected behavior from spark or is it bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41415) SASL Request Retries

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41415:


Assignee: Apache Spark

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41415) SASL Request Retries

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41415:


Assignee: (was: Apache Spark)

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41415) SASL Request Retries

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644152#comment-17644152
 ] 

Apache Spark commented on SPARK-41415:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/38959

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Affects Version/s: 3.3.1
   (was: 3.4.0)

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41434) Support LambdaFunction expresssion

2022-12-06 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41434:
-

 Summary: Support LambdaFunction expresssion
 Key: SPARK-41434
 URL: https://issues.apache.org/jira/browse/SPARK-41434
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41433:


Assignee: (was: Apache Spark)

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644145#comment-17644145
 ] 

Apache Spark commented on SPARK-41433:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38958

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644146#comment-17644146
 ] 

Apache Spark commented on SPARK-41433:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38958

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41433:


Assignee: Apache Spark

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-06 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41433:
-

 Summary: Make Max Arrow BatchSize configurable
 Key: SPARK-41433
 URL: https://issues.apache.org/jira/browse/SPARK-41433
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644144#comment-17644144
 ] 

Ruifeng Zheng commented on SPARK-41386:
---

[~dongz] I think this ticket is irrelevant to Spark-Connect? 

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2022-12-06 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644143#comment-17644143
 ] 

Gengliang Wang commented on SPARK-41053:


[~beliefer]  [~yangjie01] [~panbingkun] If you are interested in this project, 
feel free to take some tasks from the list.

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41432) Protobuf serializer for SparkPlanGraphWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41432:
--

 Summary: Protobuf serializer for SparkPlanGraphWrapper
 Key: SPARK-41432
 URL: https://issues.apache.org/jira/browse/SPARK-41432
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41431) Protobuf serializer for SQLExecutionUIData

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41431:
--

 Summary: Protobuf serializer for SQLExecutionUIData
 Key: SPARK-41431
 URL: https://issues.apache.org/jira/browse/SPARK-41431
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41430) Protobuf serializer for ProcessSummaryWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41430:
--

 Summary: Protobuf serializer for ProcessSummaryWrapper
 Key: SPARK-41430
 URL: https://issues.apache.org/jira/browse/SPARK-41430
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41429) Protobuf serializer for RDDOperationGraphWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41429:
--

 Summary: Protobuf serializer for RDDOperationGraphWrapper
 Key: SPARK-41429
 URL: https://issues.apache.org/jira/browse/SPARK-41429
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644141#comment-17644141
 ] 

Zhe Dong commented on SPARK-41386:
--

 
{noformat}
    if (mapStats.isEmpty ||
      mapStats.get.bytesByPartitionId.forall(_ <= advisorySize && _ >= 
advisorySize * smallPartitionFactor )) {
      return shuffle
    }


      if (bytes > targetSize) {
        ... 
      } else if ( bytes < targetSize * smallPartitionFactor ){
           CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
  }else {        
   return shuffle // dummy
       }{noformat}
 

 

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41427) Protobuf serializer for ExecutorStageSummaryWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41427:
--

 Summary: Protobuf serializer for ExecutorStageSummaryWrapper
 Key: SPARK-41427
 URL: https://issues.apache.org/jira/browse/SPARK-41427
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41428) Protobuf serializer for SpeculationStageSummaryWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41428:
--

 Summary: Protobuf serializer for SpeculationStageSummaryWrapper
 Key: SPARK-41428
 URL: https://issues.apache.org/jira/browse/SPARK-41428
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41426:
--

 Summary: Protobuf serializer for ResourceProfileWrapper
 Key: SPARK-41426
 URL: https://issues.apache.org/jira/browse/SPARK-41426
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41422:
--

 Summary: Protobuf serializer for ExecutorSummaryWrapper
 Key: SPARK-41422
 URL: https://issues.apache.org/jira/browse/SPARK-41422
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41425:
--

 Summary: Protobuf serializer for RDDStorageInfoWrapper
 Key: SPARK-41425
 URL: https://issues.apache.org/jira/browse/SPARK-41425
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41424) Protobuf serializer for TaskDataWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41424:
--

 Summary: Protobuf serializer for TaskDataWrapper
 Key: SPARK-41424
 URL: https://issues.apache.org/jira/browse/SPARK-41424
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41423) Protobuf serializer for StageDataWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41423:
--

 Summary: Protobuf serializer for StageDataWrapper
 Key: SPARK-41423
 URL: https://issues.apache.org/jira/browse/SPARK-41423
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41420) Protobuf serializer for ApplicationInfoWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41420:
--

 Summary: Protobuf serializer for ApplicationInfoWrapper
 Key: SPARK-41420
 URL: https://issues.apache.org/jira/browse/SPARK-41420
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-06 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41421:
--

 Summary: Protobuf serializer for ApplicationEnvironmentInfoWrapper
 Key: SPARK-41421
 URL: https://issues.apache.org/jira/browse/SPARK-41421
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Description: 
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we except that files size should be bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.

  was:
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.


> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41369) Refactor connect directory structure

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644135#comment-17644135
 ] 

Apache Spark commented on SPARK-41369:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38957

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644131#comment-17644131
 ] 

Zhe Dong commented on SPARK-41386:
--

we may change this part to avoid files that are smaller than 
"spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor"

[https://github.com/apache/spark/blob/d9c7908f348fa7771182dca49fa032f6d1b689be/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala#L75]
 

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Description: 
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.

  was:
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.


> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Description: 
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
 

9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.

  was:
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:

 
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

 

but in fact , we got some small files like the following:

 
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
 

9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.


> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
>  
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Description: 
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:

 
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

 

but in fact , we got some small files like the following:

 
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
 

9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.

  was:
{*}Problem (/*+ REBALANCE(bot_mid) */){*}:

 sparksession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true")
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m")
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}


> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
>  
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
>  
> but in fact , we got some small files like the following:
>  
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
>  
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Description: 
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.

  was:
*Problem ( REBALANCE(column)* {*}){*}:

 SparkSession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}
so, we excepted files size are bigger than 20m*0.5=10m at least. 

but in fact , we got some small files like the following:
{noformat}
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
.../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
.../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
-rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
.../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
 

9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
another way.


> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Dong updated SPARK-41386:
-
Description: 
{*}Problem (/*+ REBALANCE(bot_mid) */){*}:

 sparksession config:
{noformat}
config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true")
config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m")
config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
"0.5"){noformat}

  was:TODO:


> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> {*}Problem (/*+ REBALANCE(bot_mid) */){*}:
>  sparksession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true")
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m")
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41397) Implement part of string/binary functions

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41397:
-

Assignee: Xinrong Meng

> Implement part of string/binary functions
> -
>
> Key: SPARK-41397
> URL: https://issues.apache.org/jira/browse/SPARK-41397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41397) Implement part of string/binary functions

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41397.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38921
[https://github.com/apache/spark/pull/38921]

> Implement part of string/binary functions
> -
>
> Key: SPARK-41397
> URL: https://issues.apache.org/jira/browse/SPARK-41397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41391:
--
Description: 

scala> val df = spark.range(1, 10).withColumn("value", lit(1))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]

scala> df.createOrReplaceTempView("table")

scala> df.groupBy("id").agg(count_distinct($"value"))
res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint]

scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ")
res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): 
bigint]

scala> df.groupBy("id").agg(count_distinct($"*"))
res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): 
bigint]

scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ")
res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): 
bigint]

> The output column name of `groupBy.agg(count_distinct)` is incorrect
> 
>
> Key: SPARK-41391
> URL: https://issues.apache.org/jira/browse/SPARK-41391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> scala> val df = spark.range(1, 10).withColumn("value", lit(1))
> df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]
> scala> df.createOrReplaceTempView("table")
> scala> df.groupBy("id").agg(count_distinct($"value"))
> res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint]
> scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ")
> res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): 
> bigint]
> scala> df.groupBy("id").agg(count_distinct($"*"))
> res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): 
> bigint]
> scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ")
> res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, 
> value): bigint]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41382) implement `product` function

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41382.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38915
[https://github.com/apache/spark/pull/38915]

> implement `product` function
> 
>
> Key: SPARK-41382
> URL: https://issues.apache.org/jira/browse/SPARK-41382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41382) implement `product` function

2022-12-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41382:
-

Assignee: Ruifeng Zheng

> implement `product` function
> 
>
> Key: SPARK-41382
> URL: https://issues.apache.org/jira/browse/SPARK-41382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41419:


Assignee: Apache Spark

> [K8S] Decrement PVC_COUNTER when the pod deletion happens 
> --
>
> Key: SPARK-41419
> URL: https://issues.apache.org/jira/browse/SPARK-41419
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Major
>
> commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs.
> PVC_COUNTER should only be decremented when the pod deletion happens (in 
> response to error).
> If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) 
> (possibly due to execution not reaching resource(pvc).create() call), we 
> shouldn't decrement the counter.
> variable `success` tracks the progress of PVC creation:
> value 0 means PVC is not created.
> value 1 means PVC has been created.
> value 2 means PVC has been created but due to subsequent error, the pod is 
> deleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41419:


Assignee: (was: Apache Spark)

> [K8S] Decrement PVC_COUNTER when the pod deletion happens 
> --
>
> Key: SPARK-41419
> URL: https://issues.apache.org/jira/browse/SPARK-41419
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Ted Yu
>Priority: Major
>
> commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs.
> PVC_COUNTER should only be decremented when the pod deletion happens (in 
> response to error).
> If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) 
> (possibly due to execution not reaching resource(pvc).create() call), we 
> shouldn't decrement the counter.
> variable `success` tracks the progress of PVC creation:
> value 0 means PVC is not created.
> value 1 means PVC has been created.
> value 2 means PVC has been created but due to subsequent error, the pod is 
> deleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644104#comment-17644104
 ] 

Apache Spark commented on SPARK-41419:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/38948

> [K8S] Decrement PVC_COUNTER when the pod deletion happens 
> --
>
> Key: SPARK-41419
> URL: https://issues.apache.org/jira/browse/SPARK-41419
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Ted Yu
>Priority: Major
>
> commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs.
> PVC_COUNTER should only be decremented when the pod deletion happens (in 
> response to error).
> If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) 
> (possibly due to execution not reaching resource(pvc).create() call), we 
> shouldn't decrement the counter.
> variable `success` tracks the progress of PVC creation:
> value 0 means PVC is not created.
> value 1 means PVC has been created.
> value 2 means PVC has been created but due to subsequent error, the pod is 
> deleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens

2022-12-06 Thread Ted Yu (Jira)
Ted Yu created SPARK-41419:
--

 Summary: [K8S] Decrement PVC_COUNTER when the pod deletion happens 
 Key: SPARK-41419
 URL: https://issues.apache.org/jira/browse/SPARK-41419
 Project: Spark
  Issue Type: Task
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Ted Yu


commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs.

PVC_COUNTER should only be decremented when the pod deletion happens (in 
response to error).

If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) 
(possibly due to execution not reaching resource(pvc).create() call), we 
shouldn't decrement the counter.
variable `success` tracks the progress of PVC creation:

value 0 means PVC is not created.
value 1 means PVC has been created.
value 2 means PVC has been created but due to subsequent error, the pod is 
deleted.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41418:


Assignee: Apache Spark

> Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
> --
>
> Key: SPARK-41418
> URL: https://issues.apache.org/jira/browse/SPARK-41418
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41418:


Assignee: (was: Apache Spark)

> Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
> --
>
> Key: SPARK-41418
> URL: https://issues.apache.org/jira/browse/SPARK-41418
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644102#comment-17644102
 ] 

Apache Spark commented on SPARK-41418:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38955

> Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
> --
>
> Key: SPARK-41418
> URL: https://issues.apache.org/jira/browse/SPARK-41418
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0

2022-12-06 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-41418:
---

 Summary: Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
 Key: SPARK-41418
 URL: https://issues.apache.org/jira/browse/SPARK-41418
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644101#comment-17644101
 ] 

Apache Spark commented on SPARK-41417:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38954

> Assign a name to the error class _LEGACY_ERROR_TEMP_0019
> 
>
> Key: SPARK-41417
> URL: https://issues.apache.org/jira/browse/SPARK-41417
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41417:


Assignee: Apache Spark

> Assign a name to the error class _LEGACY_ERROR_TEMP_0019
> 
>
> Key: SPARK-41417
> URL: https://issues.apache.org/jira/browse/SPARK-41417
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41417:


Assignee: (was: Apache Spark)

> Assign a name to the error class _LEGACY_ERROR_TEMP_0019
> 
>
> Key: SPARK-41417
> URL: https://issues.apache.org/jira/browse/SPARK-41417
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41369) Refactor connect directory structure

2022-12-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41369.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38953
[https://github.com/apache/spark/pull/38953]

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41369) Refactor connect directory structure

2022-12-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41369:


Assignee: Hyukjin Kwon

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019

2022-12-06 Thread Yang Jie (Jira)
Yang Jie created SPARK-41417:


 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0019
 Key: SPARK-41417
 URL: https://issues.apache.org/jira/browse/SPARK-41417
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41369) Refactor connect directory structure

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644094#comment-17644094
 ] 

Apache Spark commented on SPARK-41369:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/38953

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39865) Show proper error messages on the overflow errors of table insert

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644092#comment-17644092
 ] 

Apache Spark commented on SPARK-39865:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38952

> Show proper error messages on the overflow errors of table insert
> -
>
> Key: SPARK-39865
> URL: https://issues.apache.org/jira/browse/SPARK-39865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.1
>
>
> In Spark 3.3, the error message of ANSI CAST is improved. However, the table 
> insertion is using the same CAST expression:
> {code:java}
> > create table tiny(i tinyint);
> > insert into tiny values (1000);
> org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of 
> the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` 
> to tolerate overflow and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
>  
> Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to 
> bypass this error` doesn't help at all. This PR is to fix the error message. 
> After changes, the error message of this example will become:
> {code:java}
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] 
> Fail to insert a value of "INT" type into the "TINYINT" type column `i` due 
> to an overflow. Use `try_cast` on the input value to tolerate overflow and 
> return NULL instead.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41416) Rewrite self join in in predicate to aggregate

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41416:


Assignee: (was: Apache Spark)

> Rewrite self join in in predicate to aggregate
> --
>
> Key: SPARK-41416
> URL: https://issues.apache.org/jira/browse/SPARK-41416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> Transforms the SelfJoin resulting in duplicate rows used for IN predicate to 
> aggregation.
> For IN predicate, duplicate rows does not have any value. It will be overhead.
> Ex: TPCDS Q95: following CTE is used only in IN predicates for only one 
> column comparison ({@code ws_order_number}).
> This results in exponential increase in Joined rows with too many duplicate 
> rows.
> {code:java}
> WITH ws_wh AS
> (
>SELECT ws1.ws_order_number,
>   ws1.ws_warehouse_sk wh1,
>   ws2.ws_warehouse_sk wh2
>FROM   web_sales ws1,
>   web_sales ws2
>WHERE  ws1.ws_order_number = ws2.ws_order_number
>ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> {code}
> Could be optimized as below:
> {code:java}
> WITH ws_wh AS
> (SELECT ws_order_number
>   FROM  web_sales
>   GROUP BY ws_order_number
>   HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
> {code}
> Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41416) Rewrite self join in in predicate to aggregate

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41416:


Assignee: Apache Spark

> Rewrite self join in in predicate to aggregate
> --
>
> Key: SPARK-41416
> URL: https://issues.apache.org/jira/browse/SPARK-41416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Major
>
> Transforms the SelfJoin resulting in duplicate rows used for IN predicate to 
> aggregation.
> For IN predicate, duplicate rows does not have any value. It will be overhead.
> Ex: TPCDS Q95: following CTE is used only in IN predicates for only one 
> column comparison ({@code ws_order_number}).
> This results in exponential increase in Joined rows with too many duplicate 
> rows.
> {code:java}
> WITH ws_wh AS
> (
>SELECT ws1.ws_order_number,
>   ws1.ws_warehouse_sk wh1,
>   ws2.ws_warehouse_sk wh2
>FROM   web_sales ws1,
>   web_sales ws2
>WHERE  ws1.ws_order_number = ws2.ws_order_number
>ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> {code}
> Could be optimized as below:
> {code:java}
> WITH ws_wh AS
> (SELECT ws_order_number
>   FROM  web_sales
>   GROUP BY ws_order_number
>   HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
> {code}
> Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41416) Rewrite self join in in predicate to aggregate

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644091#comment-17644091
 ] 

Apache Spark commented on SPARK-41416:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38951

> Rewrite self join in in predicate to aggregate
> --
>
> Key: SPARK-41416
> URL: https://issues.apache.org/jira/browse/SPARK-41416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> Transforms the SelfJoin resulting in duplicate rows used for IN predicate to 
> aggregation.
> For IN predicate, duplicate rows does not have any value. It will be overhead.
> Ex: TPCDS Q95: following CTE is used only in IN predicates for only one 
> column comparison ({@code ws_order_number}).
> This results in exponential increase in Joined rows with too many duplicate 
> rows.
> {code:java}
> WITH ws_wh AS
> (
>SELECT ws1.ws_order_number,
>   ws1.ws_warehouse_sk wh1,
>   ws2.ws_warehouse_sk wh2
>FROM   web_sales ws1,
>   web_sales ws2
>WHERE  ws1.ws_order_number = ws2.ws_order_number
>ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> {code}
> Could be optimized as below:
> {code:java}
> WITH ws_wh AS
> (SELECT ws_order_number
>   FROM  web_sales
>   GROUP BY ws_order_number
>   HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
> {code}
> Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41416) Rewrite self join in in predicate to aggregate

2022-12-06 Thread Wan Kun (Jira)
Wan Kun created SPARK-41416:
---

 Summary: Rewrite self join in in predicate to aggregate
 Key: SPARK-41416
 URL: https://issues.apache.org/jira/browse/SPARK-41416
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wan Kun


Transforms the SelfJoin resulting in duplicate rows used for IN predicate to 
aggregation.
For IN predicate, duplicate rows does not have any value. It will be overhead.

Ex: TPCDS Q95: following CTE is used only in IN predicates for only one column 
comparison ({@code ws_order_number}).
This results in exponential increase in Joined rows with too many duplicate 
rows.


{code:java}
WITH ws_wh AS
(
   SELECT ws1.ws_order_number,
  ws1.ws_warehouse_sk wh1,
  ws2.ws_warehouse_sk wh2
   FROM   web_sales ws1,
  web_sales ws2
   WHERE  ws1.ws_order_number = ws2.ws_order_number
   ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
{code}



Could be optimized as below:


{code:java}
WITH ws_wh AS
(SELECT ws_order_number
  FROM  web_sales
  GROUP BY ws_order_number
  HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
{code}


Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41366) DF.groupby.agg() API should be compatible

2022-12-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-41366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-41366.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> DF.groupby.agg() API should be compatible
> -
>
> Key: SPARK-41366
> URL: https://issues.apache.org/jira/browse/SPARK-41366
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41413) Storage-Partitioned Join should avoid shuffle when partition keys mismatch, but join expressions are compatible

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41413:


Assignee: (was: Apache Spark)

> Storage-Partitioned Join should avoid shuffle when partition keys mismatch, 
> but join expressions are compatible
> ---
>
> Key: SPARK-41413
> URL: https://issues.apache.org/jira/browse/SPARK-41413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently when checking whether two sides of a Storage Partitioned Join are 
> compatible, we requires both the partition expressions as well as the 
> partition keys are compatible. However, this condition could be relaxed so 
> that we only require the former. In the case that the latter is not 
> compatible, we can calculate a common superset of keys and push down the 
> information to both sides of the join, and use empty partitions for the 
> missing keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41413) Storage-Partitioned Join should avoid shuffle when partition keys mismatch, but join expressions are compatible

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644088#comment-17644088
 ] 

Apache Spark commented on SPARK-41413:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/38950

> Storage-Partitioned Join should avoid shuffle when partition keys mismatch, 
> but join expressions are compatible
> ---
>
> Key: SPARK-41413
> URL: https://issues.apache.org/jira/browse/SPARK-41413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently when checking whether two sides of a Storage Partitioned Join are 
> compatible, we requires both the partition expressions as well as the 
> partition keys are compatible. However, this condition could be relaxed so 
> that we only require the former. In the case that the latter is not 
> compatible, we can calculate a common superset of keys and push down the 
> information to both sides of the join, and use empty partitions for the 
> missing keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41413) Storage-Partitioned Join should avoid shuffle when partition keys mismatch, but join expressions are compatible

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41413:


Assignee: Apache Spark

> Storage-Partitioned Join should avoid shuffle when partition keys mismatch, 
> but join expressions are compatible
> ---
>
> Key: SPARK-41413
> URL: https://issues.apache.org/jira/browse/SPARK-41413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently when checking whether two sides of a Storage Partitioned Join are 
> compatible, we requires both the partition expressions as well as the 
> partition keys are compatible. However, this condition could be relaxed so 
> that we only require the former. In the case that the latter is not 
> compatible, we can calculate a common superset of keys and push down the 
> information to both sides of the join, and use empty partitions for the 
> missing keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-06 Thread Zhen Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644087#comment-17644087
 ] 

Zhen Wang commented on SPARK-41344:
---

[~planga82]  Thanks for your reply, I have submitted a PR 
[https://github.com/apache/spark/pull/38871], can you help me review it?

 

> Maybe the best solution is to have another function that does not catch those 
> exceptions to use in this case and does not return an option.

Does this mean we need to add a new method in CatalogV2Util?

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644083#comment-17644083
 ] 

Apache Spark commented on SPARK-41410:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38949

> Support PVC-oriented executor pod allocation
> 
>
> Key: SPARK-41410
> URL: https://issues.apache.org/jira/browse/SPARK-41410
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644082#comment-17644082
 ] 

Apache Spark commented on SPARK-41410:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38949

> Support PVC-oriented executor pod allocation
> 
>
> Key: SPARK-41410
> URL: https://issues.apache.org/jira/browse/SPARK-41410
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41415) SASL Request Retries

2022-12-06 Thread Aravind Patnam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aravind Patnam updated SPARK-41415:
---
Summary: SASL Request Retries  (was: SASL Request Retry)

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41415) SASL Request Retry

2022-12-06 Thread Aravind Patnam (Jira)
Aravind Patnam created SPARK-41415:
--

 Summary: SASL Request Retry
 Key: SPARK-41415
 URL: https://issues.apache.org/jira/browse/SPARK-41415
 Project: Spark
  Issue Type: Task
  Components: Shuffle
Affects Versions: 3.2.4
Reporter: Aravind Patnam






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644079#comment-17644079
 ] 

Apache Spark commented on SPARK-41410:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/38948

> Support PVC-oriented executor pod allocation
> 
>
> Key: SPARK-41410
> URL: https://issues.apache.org/jira/browse/SPARK-41410
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644080#comment-17644080
 ] 

Apache Spark commented on SPARK-41410:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/38948

> Support PVC-oriented executor pod allocation
> 
>
> Key: SPARK-41410
> URL: https://issues.apache.org/jira/browse/SPARK-41410
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41233) High-order function: array_prepend

2022-12-06 Thread Navin Viswanath (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644070#comment-17644070
 ] 

Navin Viswanath commented on SPARK-41233:
-

PR : [https://github.com/apache/spark/pull/38947]

 

 

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41231) Built-in SQL Function Improvement

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41231:


Assignee: Apache Spark

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-41231
> URL: https://issues.apache.org/jira/browse/SPARK-41231
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41231) Built-in SQL Function Improvement

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644066#comment-17644066
 ] 

Apache Spark commented on SPARK-41231:
--

User 'navinvishy' has created a pull request for this issue:
https://github.com/apache/spark/pull/38947

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-41231
> URL: https://issues.apache.org/jira/browse/SPARK-41231
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41231) Built-in SQL Function Improvement

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41231:


Assignee: (was: Apache Spark)

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-41231
> URL: https://issues.apache.org/jira/browse/SPARK-41231
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41411) Multi-Stateful Operator watermark support bug fix

2022-12-06 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-41411.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38945
[https://github.com/apache/spark/pull/38945]

> Multi-Stateful Operator watermark support bug fix
> -
>
> Key: SPARK-41411
> URL: https://issues.apache.org/jira/browse/SPARK-41411
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.4.0
>
>
> A typo in passing event time watermark to`StreamingSymmetricHashJoinExec` 
> causes logic errrors. With the bug, the query would work with no error 
> reported but producing incorrect results. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41411) Multi-Stateful Operator watermark support bug fix

2022-12-06 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-41411:


Assignee: Wei Liu

> Multi-Stateful Operator watermark support bug fix
> -
>
> Key: SPARK-41411
> URL: https://issues.apache.org/jira/browse/SPARK-41411
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>
> A typo in passing event time watermark to`StreamingSymmetricHashJoinExec` 
> causes logic errrors. With the bug, the query would work with no error 
> reported but producing incorrect results. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-06 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644061#comment-17644061
 ] 

Pablo Langa Blanco commented on SPARK-41344:


In this case the provider has been detected as DataSourceV2 and also implements 
SupportsCatalogOptions, so if it fails at that point, it does not make sense to 
try it as DataSource V1.

The CatalogV2Util.loadTable function catches NoSuchTableException, 
NoSuchDatabaseException and NoSuchNamespaceException to return an option, which 
makes sense in other places where it is used, but not at this point. Maybe the 
best solution is to have another function that does not catch those exceptions 
to use in this case and does not return an option.

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41414) Implement date/timestamp functions

2022-12-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644049#comment-17644049
 ] 

Apache Spark commented on SPARK-41414:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38946

> Implement date/timestamp functions
> --
>
> Key: SPARK-41414
> URL: https://issues.apache.org/jira/browse/SPARK-41414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement data/timestamp functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41414) Implement date/timestamp functions

2022-12-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41414:


Assignee: (was: Apache Spark)

> Implement date/timestamp functions
> --
>
> Key: SPARK-41414
> URL: https://issues.apache.org/jira/browse/SPARK-41414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement data/timestamp functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >