[jira] [Commented] (SPARK-40852) Implement `DataFrame.summary`
[ https://issues.apache.org/jira/browse/SPARK-40852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644199#comment-17644199 ] Apache Spark commented on SPARK-40852: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/38962 > Implement `DataFrame.summary` > - > > Key: SPARK-40852 > URL: https://issues.apache.org/jira/browse/SPARK-40852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40852) Implement `DataFrame.summary`
[ https://issues.apache.org/jira/browse/SPARK-40852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644200#comment-17644200 ] Apache Spark commented on SPARK-40852: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/38962 > Implement `DataFrame.summary` > - > > Key: SPARK-40852 > URL: https://issues.apache.org/jira/browse/SPARK-40852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41381) Implement count_distinct and sum_distinct functions
[ https://issues.apache.org/jira/browse/SPARK-41381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41381: - Assignee: Ruifeng Zheng > Implement count_distinct and sum_distinct functions > --- > > Key: SPARK-41381 > URL: https://issues.apache.org/jira/browse/SPARK-41381 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41381) Implement count_distinct and sum_distinct functions
[ https://issues.apache.org/jira/browse/SPARK-41381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41381. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38914 [https://github.com/apache/spark/pull/38914] > Implement count_distinct and sum_distinct functions > --- > > Key: SPARK-41381 > URL: https://issues.apache.org/jira/browse/SPARK-41381 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41437) Do not optimize the input query twice for v1 write fallback
[ https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644185#comment-17644185 ] Apache Spark commented on SPARK-41437: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/38942 > Do not optimize the input query twice for v1 write fallback > --- > > Key: SPARK-41437 > URL: https://issues.apache.org/jira/browse/SPARK-41437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41437) Do not optimize the input query twice for v1 write fallback
[ https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41437: Assignee: (was: Apache Spark) > Do not optimize the input query twice for v1 write fallback > --- > > Key: SPARK-41437 > URL: https://issues.apache.org/jira/browse/SPARK-41437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41437) Do not optimize the input query twice for v1 write fallback
[ https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41437: Assignee: Apache Spark > Do not optimize the input query twice for v1 write fallback > --- > > Key: SPARK-41437 > URL: https://issues.apache.org/jira/browse/SPARK-41437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41437) Do not optimize the input query twice for v1 write fallback
Wenchen Fan created SPARK-41437: --- Summary: Do not optimize the input query twice for v1 write fallback Key: SPARK-41437 URL: https://issues.apache.org/jira/browse/SPARK-41437 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41283) Feature parity: Functions API in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-41283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41283: -- Summary: Feature parity: Functions API in Spark Connect (was: Feature parity: functions API in Spark Connect) > Feature parity: Functions API in Spark Connect > -- > > Key: SPARK-41283 > URL: https://issues.apache.org/jira/browse/SPARK-41283 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Xinrong Meng >Priority: Critical > > Implement functions API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644175#comment-17644175 ] Zhe Dong commented on SPARK-41386: -- Hi. [~podongfeng] That was my mistake. I removed it. sorry for that. > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41436) Implement `collection` functions: A~C
[ https://issues.apache.org/jira/browse/SPARK-41436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41436: Assignee: (was: Apache Spark) > Implement `collection` functions: A~C > - > > Key: SPARK-41436 > URL: https://issues.apache.org/jira/browse/SPARK-41436 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41436) Implement `collection` functions: A~C
[ https://issues.apache.org/jira/browse/SPARK-41436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41436: Assignee: Apache Spark > Implement `collection` functions: A~C > - > > Key: SPARK-41436 > URL: https://issues.apache.org/jira/browse/SPARK-41436 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41436) Implement `collection` functions: A~C
[ https://issues.apache.org/jira/browse/SPARK-41436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644174#comment-17644174 ] Apache Spark commented on SPARK-41436: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38961 > Implement `collection` functions: A~C > - > > Key: SPARK-41436 > URL: https://issues.apache.org/jira/browse/SPARK-41436 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Epic Link: (was: SPARK-39375) > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41436) Implement `collection` functions: A~C
Ruifeng Zheng created SPARK-41436: - Summary: Implement `collection` functions: A~C Key: SPARK-41436 URL: https://issues.apache.org/jira/browse/SPARK-41436 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null
[ https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644173#comment-17644173 ] Apache Spark commented on SPARK-41435: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38960 > Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 > ` when args is not null > - > > Key: SPARK-41435 > URL: https://issues.apache.org/jira/browse/SPARK-41435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null
[ https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41435: Assignee: (was: Apache Spark) > Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 > ` when args is not null > - > > Key: SPARK-41435 > URL: https://issues.apache.org/jira/browse/SPARK-41435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null
[ https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41435: Assignee: Apache Spark > Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 > ` when args is not null > - > > Key: SPARK-41435 > URL: https://issues.apache.org/jira/browse/SPARK-41435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null
[ https://issues.apache.org/jira/browse/SPARK-41435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644172#comment-17644172 ] Apache Spark commented on SPARK-41435: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38960 > Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 > ` when args is not null > - > > Key: SPARK-41435 > URL: https://issues.apache.org/jira/browse/SPARK-41435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41435) Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null
Yang Jie created SPARK-41435: Summary: Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null Key: SPARK-41435 URL: https://issues.apache.org/jira/browse/SPARK-41435 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41298) Getting Count on data frame is giving the performance issue
[ https://issues.apache.org/jira/browse/SPARK-41298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644164#comment-17644164 ] Ramakrishna commented on SPARK-41298: - Can some one please check behavior and update me asap. > Getting Count on data frame is giving the performance issue > --- > > Key: SPARK-41298 > URL: https://issues.apache.org/jira/browse/SPARK-41298 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are invoking below query on Teradata > 1) Dataframe df = spark.format("jdbc"). . . load(); > 2) int count = df.count(); > When we executed the df.count spark internally issuing the below query on > teradata which is wasting the lot of CPU on teradata and DBAs are making > noise by seeing this query. > > Query : SELECT 1 FROM ()SPARK_SUB_TAB > Response: > 1 > 1 > 1 > 1 > 1 > .. > 1 > > Is this expected behavior from spark or is it bug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41415: Assignee: Apache Spark > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41415: Assignee: (was: Apache Spark) > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644152#comment-17644152 ] Apache Spark commented on SPARK-41415: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/38959 > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Affects Version/s: 3.3.1 (was: 3.4.0) > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41434) Support LambdaFunction expresssion
Ruifeng Zheng created SPARK-41434: - Summary: Support LambdaFunction expresssion Key: SPARK-41434 URL: https://issues.apache.org/jira/browse/SPARK-41434 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41433) Make Max Arrow BatchSize configurable
[ https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41433: Assignee: (was: Apache Spark) > Make Max Arrow BatchSize configurable > - > > Key: SPARK-41433 > URL: https://issues.apache.org/jira/browse/SPARK-41433 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41433) Make Max Arrow BatchSize configurable
[ https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644145#comment-17644145 ] Apache Spark commented on SPARK-41433: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38958 > Make Max Arrow BatchSize configurable > - > > Key: SPARK-41433 > URL: https://issues.apache.org/jira/browse/SPARK-41433 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41433) Make Max Arrow BatchSize configurable
[ https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644146#comment-17644146 ] Apache Spark commented on SPARK-41433: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38958 > Make Max Arrow BatchSize configurable > - > > Key: SPARK-41433 > URL: https://issues.apache.org/jira/browse/SPARK-41433 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41433) Make Max Arrow BatchSize configurable
[ https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41433: Assignee: Apache Spark > Make Max Arrow BatchSize configurable > - > > Key: SPARK-41433 > URL: https://issues.apache.org/jira/browse/SPARK-41433 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41433) Make Max Arrow BatchSize configurable
Ruifeng Zheng created SPARK-41433: - Summary: Make Max Arrow BatchSize configurable Key: SPARK-41433 URL: https://issues.apache.org/jira/browse/SPARK-41433 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644144#comment-17644144 ] Ruifeng Zheng commented on SPARK-41386: --- [~dongz] I think this ticket is irrelevant to Spark-Connect? > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644143#comment-17644143 ] Gengliang Wang commented on SPARK-41053: [~beliefer] [~yangjie01] [~panbingkun] If you are interested in this project, feel free to take some tasks from the list. > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41432) Protobuf serializer for SparkPlanGraphWrapper
Gengliang Wang created SPARK-41432: -- Summary: Protobuf serializer for SparkPlanGraphWrapper Key: SPARK-41432 URL: https://issues.apache.org/jira/browse/SPARK-41432 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41431) Protobuf serializer for SQLExecutionUIData
Gengliang Wang created SPARK-41431: -- Summary: Protobuf serializer for SQLExecutionUIData Key: SPARK-41431 URL: https://issues.apache.org/jira/browse/SPARK-41431 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41430) Protobuf serializer for ProcessSummaryWrapper
Gengliang Wang created SPARK-41430: -- Summary: Protobuf serializer for ProcessSummaryWrapper Key: SPARK-41430 URL: https://issues.apache.org/jira/browse/SPARK-41430 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41429) Protobuf serializer for RDDOperationGraphWrapper
Gengliang Wang created SPARK-41429: -- Summary: Protobuf serializer for RDDOperationGraphWrapper Key: SPARK-41429 URL: https://issues.apache.org/jira/browse/SPARK-41429 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644141#comment-17644141 ] Zhe Dong commented on SPARK-41386: -- {noformat} if (mapStats.isEmpty || mapStats.get.bytesByPartitionId.forall(_ <= advisorySize && _ >= advisorySize * smallPartitionFactor )) { return shuffle } if (bytes > targetSize) { ... } else if ( bytes < targetSize * smallPartitionFactor ){ CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil }else { return shuffle // dummy }{noformat} > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41427) Protobuf serializer for ExecutorStageSummaryWrapper
Gengliang Wang created SPARK-41427: -- Summary: Protobuf serializer for ExecutorStageSummaryWrapper Key: SPARK-41427 URL: https://issues.apache.org/jira/browse/SPARK-41427 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41428) Protobuf serializer for SpeculationStageSummaryWrapper
Gengliang Wang created SPARK-41428: -- Summary: Protobuf serializer for SpeculationStageSummaryWrapper Key: SPARK-41428 URL: https://issues.apache.org/jira/browse/SPARK-41428 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper
Gengliang Wang created SPARK-41426: -- Summary: Protobuf serializer for ResourceProfileWrapper Key: SPARK-41426 URL: https://issues.apache.org/jira/browse/SPARK-41426 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
Gengliang Wang created SPARK-41422: -- Summary: Protobuf serializer for ExecutorSummaryWrapper Key: SPARK-41422 URL: https://issues.apache.org/jira/browse/SPARK-41422 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper
Gengliang Wang created SPARK-41425: -- Summary: Protobuf serializer for RDDStorageInfoWrapper Key: SPARK-41425 URL: https://issues.apache.org/jira/browse/SPARK-41425 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41424) Protobuf serializer for TaskDataWrapper
Gengliang Wang created SPARK-41424: -- Summary: Protobuf serializer for TaskDataWrapper Key: SPARK-41424 URL: https://issues.apache.org/jira/browse/SPARK-41424 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41423) Protobuf serializer for StageDataWrapper
Gengliang Wang created SPARK-41423: -- Summary: Protobuf serializer for StageDataWrapper Key: SPARK-41423 URL: https://issues.apache.org/jira/browse/SPARK-41423 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41420) Protobuf serializer for ApplicationInfoWrapper
Gengliang Wang created SPARK-41420: -- Summary: Protobuf serializer for ApplicationInfoWrapper Key: SPARK-41420 URL: https://issues.apache.org/jira/browse/SPARK-41420 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
Gengliang Wang created SPARK-41421: -- Summary: Protobuf serializer for ApplicationEnvironmentInfoWrapper Key: SPARK-41421 URL: https://issues.apache.org/jira/browse/SPARK-41421 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Description: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we except that files size should be bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. was: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644135#comment-17644135 ] Apache Spark commented on SPARK-41369: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38957 > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644131#comment-17644131 ] Zhe Dong commented on SPARK-41386: -- we may change this part to avoid files that are smaller than "spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor" [https://github.com/apache/spark/blob/d9c7908f348fa7771182dca49fa032f6d1b689be/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala#L75] > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Description: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. was: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Description: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. was: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Description: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. was: {*}Problem (/*+ REBALANCE(bot_mid) */){*}: sparksession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > > but in fact , we got some small files like the following: > > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Description: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. was: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Dong updated SPARK-41386: - Description: {*}Problem (/*+ REBALANCE(bot_mid) */){*}: sparksession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} was:TODO: > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > {*}Problem (/*+ REBALANCE(bot_mid) */){*}: > sparksession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41397) Implement part of string/binary functions
[ https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41397: - Assignee: Xinrong Meng > Implement part of string/binary functions > - > > Key: SPARK-41397 > URL: https://issues.apache.org/jira/browse/SPARK-41397 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41397) Implement part of string/binary functions
[ https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41397. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38921 [https://github.com/apache/spark/pull/38921] > Implement part of string/binary functions > - > > Key: SPARK-41397 > URL: https://issues.apache.org/jira/browse/SPARK-41397 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41391: -- Description: scala> val df = spark.range(1, 10).withColumn("value", lit(1)) df: org.apache.spark.sql.DataFrame = [id: bigint, value: int] scala> df.createOrReplaceTempView("table") scala> df.groupBy("id").agg(count_distinct($"value")) res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint] scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ") res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint] scala> df.groupBy("id").agg(count_distinct($"*")) res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): bigint] scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ") res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): bigint] > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > scala> val df = spark.range(1, 10).withColumn("value", lit(1)) > df: org.apache.spark.sql.DataFrame = [id: bigint, value: int] > scala> df.createOrReplaceTempView("table") > scala> df.groupBy("id").agg(count_distinct($"value")) > res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint] > scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ") > res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): > bigint] > scala> df.groupBy("id").agg(count_distinct($"*")) > res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): > bigint] > scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ") > res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, > value): bigint] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41382) implement `product` function
[ https://issues.apache.org/jira/browse/SPARK-41382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41382. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38915 [https://github.com/apache/spark/pull/38915] > implement `product` function > > > Key: SPARK-41382 > URL: https://issues.apache.org/jira/browse/SPARK-41382 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41382) implement `product` function
[ https://issues.apache.org/jira/browse/SPARK-41382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41382: - Assignee: Ruifeng Zheng > implement `product` function > > > Key: SPARK-41382 > URL: https://issues.apache.org/jira/browse/SPARK-41382 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens
[ https://issues.apache.org/jira/browse/SPARK-41419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41419: Assignee: Apache Spark > [K8S] Decrement PVC_COUNTER when the pod deletion happens > -- > > Key: SPARK-41419 > URL: https://issues.apache.org/jira/browse/SPARK-41419 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Major > > commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs. > PVC_COUNTER should only be decremented when the pod deletion happens (in > response to error). > If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) > (possibly due to execution not reaching resource(pvc).create() call), we > shouldn't decrement the counter. > variable `success` tracks the progress of PVC creation: > value 0 means PVC is not created. > value 1 means PVC has been created. > value 2 means PVC has been created but due to subsequent error, the pod is > deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens
[ https://issues.apache.org/jira/browse/SPARK-41419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41419: Assignee: (was: Apache Spark) > [K8S] Decrement PVC_COUNTER when the pod deletion happens > -- > > Key: SPARK-41419 > URL: https://issues.apache.org/jira/browse/SPARK-41419 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Ted Yu >Priority: Major > > commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs. > PVC_COUNTER should only be decremented when the pod deletion happens (in > response to error). > If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) > (possibly due to execution not reaching resource(pvc).create() call), we > shouldn't decrement the counter. > variable `success` tracks the progress of PVC creation: > value 0 means PVC is not created. > value 1 means PVC has been created. > value 2 means PVC has been created but due to subsequent error, the pod is > deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens
[ https://issues.apache.org/jira/browse/SPARK-41419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644104#comment-17644104 ] Apache Spark commented on SPARK-41419: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38948 > [K8S] Decrement PVC_COUNTER when the pod deletion happens > -- > > Key: SPARK-41419 > URL: https://issues.apache.org/jira/browse/SPARK-41419 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Ted Yu >Priority: Major > > commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs. > PVC_COUNTER should only be decremented when the pod deletion happens (in > response to error). > If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) > (possibly due to execution not reaching resource(pvc).create() call), we > shouldn't decrement the counter. > variable `success` tracks the progress of PVC creation: > value 0 means PVC is not created. > value 1 means PVC has been created. > value 2 means PVC has been created but due to subsequent error, the pod is > deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41419) [K8S] Decrement PVC_COUNTER when the pod deletion happens
Ted Yu created SPARK-41419: -- Summary: [K8S] Decrement PVC_COUNTER when the pod deletion happens Key: SPARK-41419 URL: https://issues.apache.org/jira/browse/SPARK-41419 Project: Spark Issue Type: Task Components: Kubernetes Affects Versions: 3.4.0 Reporter: Ted Yu commit cc55de3 introduced PVC_COUNTER to track outstanding number of PVCs. PVC_COUNTER should only be decremented when the pod deletion happens (in response to error). If the PVC isn't created successfully (where PVC_COUNTER isn't incremented) (possibly due to execution not reaching resource(pvc).create() call), we shouldn't decrement the counter. variable `success` tracks the progress of PVC creation: value 0 means PVC is not created. value 1 means PVC has been created. value 2 means PVC has been created but due to subsequent error, the pod is deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
[ https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41418: Assignee: Apache Spark > Upgrade scala-maven-plugin from 4.7.2 to 4.8.0 > -- > > Key: SPARK-41418 > URL: https://issues.apache.org/jira/browse/SPARK-41418 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
[ https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41418: Assignee: (was: Apache Spark) > Upgrade scala-maven-plugin from 4.7.2 to 4.8.0 > -- > > Key: SPARK-41418 > URL: https://issues.apache.org/jira/browse/SPARK-41418 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
[ https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644102#comment-17644102 ] Apache Spark commented on SPARK-41418: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/38955 > Upgrade scala-maven-plugin from 4.7.2 to 4.8.0 > -- > > Key: SPARK-41418 > URL: https://issues.apache.org/jira/browse/SPARK-41418 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
BingKun Pan created SPARK-41418: --- Summary: Upgrade scala-maven-plugin from 4.7.2 to 4.8.0 Key: SPARK-41418 URL: https://issues.apache.org/jira/browse/SPARK-41418 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019
[ https://issues.apache.org/jira/browse/SPARK-41417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644101#comment-17644101 ] Apache Spark commented on SPARK-41417: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38954 > Assign a name to the error class _LEGACY_ERROR_TEMP_0019 > > > Key: SPARK-41417 > URL: https://issues.apache.org/jira/browse/SPARK-41417 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019
[ https://issues.apache.org/jira/browse/SPARK-41417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41417: Assignee: Apache Spark > Assign a name to the error class _LEGACY_ERROR_TEMP_0019 > > > Key: SPARK-41417 > URL: https://issues.apache.org/jira/browse/SPARK-41417 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019
[ https://issues.apache.org/jira/browse/SPARK-41417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41417: Assignee: (was: Apache Spark) > Assign a name to the error class _LEGACY_ERROR_TEMP_0019 > > > Key: SPARK-41417 > URL: https://issues.apache.org/jira/browse/SPARK-41417 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41369. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38953 [https://github.com/apache/spark/pull/38953] > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41369: Assignee: Hyukjin Kwon > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Hyukjin Kwon >Priority: Major > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41417) Assign a name to the error class _LEGACY_ERROR_TEMP_0019
Yang Jie created SPARK-41417: Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0019 Key: SPARK-41417 URL: https://issues.apache.org/jira/browse/SPARK-41417 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644094#comment-17644094 ] Apache Spark commented on SPARK-41369: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/38953 > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39865) Show proper error messages on the overflow errors of table insert
[ https://issues.apache.org/jira/browse/SPARK-39865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644092#comment-17644092 ] Apache Spark commented on SPARK-39865: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/38952 > Show proper error messages on the overflow errors of table insert > - > > Key: SPARK-39865 > URL: https://issues.apache.org/jira/browse/SPARK-39865 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.1 > > > In Spark 3.3, the error message of ANSI CAST is improved. However, the table > insertion is using the same CAST expression: > {code:java} > > create table tiny(i tinyint); > > insert into tiny values (1000); > org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of > the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` > to tolerate overflow and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. > {code} > > Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to > bypass this error` doesn't help at all. This PR is to fix the error message. > After changes, the error message of this example will become: > {code:java} > org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] > Fail to insert a value of "INT" type into the "TINYINT" type column `i` due > to an overflow. Use `try_cast` on the input value to tolerate overflow and > return NULL instead.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41416) Rewrite self join in in predicate to aggregate
[ https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41416: Assignee: (was: Apache Spark) > Rewrite self join in in predicate to aggregate > -- > > Key: SPARK-41416 > URL: https://issues.apache.org/jira/browse/SPARK-41416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Major > > Transforms the SelfJoin resulting in duplicate rows used for IN predicate to > aggregation. > For IN predicate, duplicate rows does not have any value. It will be overhead. > Ex: TPCDS Q95: following CTE is used only in IN predicates for only one > column comparison ({@code ws_order_number}). > This results in exponential increase in Joined rows with too many duplicate > rows. > {code:java} > WITH ws_wh AS > ( >SELECT ws1.ws_order_number, > ws1.ws_warehouse_sk wh1, > ws2.ws_warehouse_sk wh2 >FROM web_sales ws1, > web_sales ws2 >WHERE ws1.ws_order_number = ws2.ws_order_number >ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) > {code} > Could be optimized as below: > {code:java} > WITH ws_wh AS > (SELECT ws_order_number > FROM web_sales > GROUP BY ws_order_number > HAVING COUNT(DISTINCT ws_warehouse_sk) > 1) > {code} > Optimized CTE scans table only once and results in unique rows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41416) Rewrite self join in in predicate to aggregate
[ https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41416: Assignee: Apache Spark > Rewrite self join in in predicate to aggregate > -- > > Key: SPARK-41416 > URL: https://issues.apache.org/jira/browse/SPARK-41416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > > Transforms the SelfJoin resulting in duplicate rows used for IN predicate to > aggregation. > For IN predicate, duplicate rows does not have any value. It will be overhead. > Ex: TPCDS Q95: following CTE is used only in IN predicates for only one > column comparison ({@code ws_order_number}). > This results in exponential increase in Joined rows with too many duplicate > rows. > {code:java} > WITH ws_wh AS > ( >SELECT ws1.ws_order_number, > ws1.ws_warehouse_sk wh1, > ws2.ws_warehouse_sk wh2 >FROM web_sales ws1, > web_sales ws2 >WHERE ws1.ws_order_number = ws2.ws_order_number >ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) > {code} > Could be optimized as below: > {code:java} > WITH ws_wh AS > (SELECT ws_order_number > FROM web_sales > GROUP BY ws_order_number > HAVING COUNT(DISTINCT ws_warehouse_sk) > 1) > {code} > Optimized CTE scans table only once and results in unique rows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41416) Rewrite self join in in predicate to aggregate
[ https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644091#comment-17644091 ] Apache Spark commented on SPARK-41416: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38951 > Rewrite self join in in predicate to aggregate > -- > > Key: SPARK-41416 > URL: https://issues.apache.org/jira/browse/SPARK-41416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Major > > Transforms the SelfJoin resulting in duplicate rows used for IN predicate to > aggregation. > For IN predicate, duplicate rows does not have any value. It will be overhead. > Ex: TPCDS Q95: following CTE is used only in IN predicates for only one > column comparison ({@code ws_order_number}). > This results in exponential increase in Joined rows with too many duplicate > rows. > {code:java} > WITH ws_wh AS > ( >SELECT ws1.ws_order_number, > ws1.ws_warehouse_sk wh1, > ws2.ws_warehouse_sk wh2 >FROM web_sales ws1, > web_sales ws2 >WHERE ws1.ws_order_number = ws2.ws_order_number >ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) > {code} > Could be optimized as below: > {code:java} > WITH ws_wh AS > (SELECT ws_order_number > FROM web_sales > GROUP BY ws_order_number > HAVING COUNT(DISTINCT ws_warehouse_sk) > 1) > {code} > Optimized CTE scans table only once and results in unique rows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41416) Rewrite self join in in predicate to aggregate
Wan Kun created SPARK-41416: --- Summary: Rewrite self join in in predicate to aggregate Key: SPARK-41416 URL: https://issues.apache.org/jira/browse/SPARK-41416 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Wan Kun Transforms the SelfJoin resulting in duplicate rows used for IN predicate to aggregation. For IN predicate, duplicate rows does not have any value. It will be overhead. Ex: TPCDS Q95: following CTE is used only in IN predicates for only one column comparison ({@code ws_order_number}). This results in exponential increase in Joined rows with too many duplicate rows. {code:java} WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number = ws2.ws_order_number ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) {code} Could be optimized as below: {code:java} WITH ws_wh AS (SELECT ws_order_number FROM web_sales GROUP BY ws_order_number HAVING COUNT(DISTINCT ws_warehouse_sk) > 1) {code} Optimized CTE scans table only once and results in unique rows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41366) DF.groupby.agg() API should be compatible
[ https://issues.apache.org/jira/browse/SPARK-41366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-41366. --- Fix Version/s: 3.4.0 Resolution: Fixed > DF.groupby.agg() API should be compatible > - > > Key: SPARK-41366 > URL: https://issues.apache.org/jira/browse/SPARK-41366 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41413) Storage-Partitioned Join should avoid shuffle when partition keys mismatch, but join expressions are compatible
[ https://issues.apache.org/jira/browse/SPARK-41413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41413: Assignee: (was: Apache Spark) > Storage-Partitioned Join should avoid shuffle when partition keys mismatch, > but join expressions are compatible > --- > > Key: SPARK-41413 > URL: https://issues.apache.org/jira/browse/SPARK-41413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > > Currently when checking whether two sides of a Storage Partitioned Join are > compatible, we requires both the partition expressions as well as the > partition keys are compatible. However, this condition could be relaxed so > that we only require the former. In the case that the latter is not > compatible, we can calculate a common superset of keys and push down the > information to both sides of the join, and use empty partitions for the > missing keys. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41413) Storage-Partitioned Join should avoid shuffle when partition keys mismatch, but join expressions are compatible
[ https://issues.apache.org/jira/browse/SPARK-41413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644088#comment-17644088 ] Apache Spark commented on SPARK-41413: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/38950 > Storage-Partitioned Join should avoid shuffle when partition keys mismatch, > but join expressions are compatible > --- > > Key: SPARK-41413 > URL: https://issues.apache.org/jira/browse/SPARK-41413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > > Currently when checking whether two sides of a Storage Partitioned Join are > compatible, we requires both the partition expressions as well as the > partition keys are compatible. However, this condition could be relaxed so > that we only require the former. In the case that the latter is not > compatible, we can calculate a common superset of keys and push down the > information to both sides of the join, and use empty partitions for the > missing keys. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41413) Storage-Partitioned Join should avoid shuffle when partition keys mismatch, but join expressions are compatible
[ https://issues.apache.org/jira/browse/SPARK-41413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41413: Assignee: Apache Spark > Storage-Partitioned Join should avoid shuffle when partition keys mismatch, > but join expressions are compatible > --- > > Key: SPARK-41413 > URL: https://issues.apache.org/jira/browse/SPARK-41413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Currently when checking whether two sides of a Storage Partitioned Join are > compatible, we requires both the partition expressions as well as the > partition keys are compatible. However, this condition could be relaxed so > that we only require the former. In the case that the latter is not > compatible, we can calculate a common superset of keys and push down the > information to both sides of the join, and use empty partitions for the > missing keys. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error
[ https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644087#comment-17644087 ] Zhen Wang commented on SPARK-41344: --- [~planga82] Thanks for your reply, I have submitted a PR [https://github.com/apache/spark/pull/38871], can you help me review it? > Maybe the best solution is to have another function that does not catch those > exceptions to use in this case and does not return an option. Does this mean we need to add a new method in CatalogV2Util? > Reading V2 datasource masks underlying error > > > Key: SPARK-41344 > URL: https://issues.apache.org/jira/browse/SPARK-41344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.4.0 >Reporter: Kevin Cheung >Priority: Critical > Attachments: image-2022-12-03-09-24-43-285.png > > > In Spark 3.3, > # DataSourceV2Utils, the loadV2Source calls: > {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, > Some(catalog), Some(ident)). > # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with > any of these exceptions NoSuchTableException, NoSuchDatabaseException, > NoSuchNamespaceException, it would return None > # Coming back to DataSourceV2Utils, None was previously returned and calling > None.get results in a cryptic error technically "correct", but the *original > exceptions NoSuchTableException, NoSuchDatabaseException, > NoSuchNamespaceException are thrown away.* > > *Ask:* > Retain the original error and propagate this to the user. Prior to Spark 3.3, > the *original error* was shown and this seems like a design flaw. > > *Sample user facing error:* > None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209) > at scala.Option.flatMap(Option.scala:271) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171) > > *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get* > [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137] > *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))* > {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341] > *CatalogV2Util.scala - catching the exceptions and return None* > [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation
[ https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644083#comment-17644083 ] Apache Spark commented on SPARK-41410: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38949 > Support PVC-oriented executor pod allocation > > > Key: SPARK-41410 > URL: https://issues.apache.org/jira/browse/SPARK-41410 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation
[ https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644082#comment-17644082 ] Apache Spark commented on SPARK-41410: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38949 > Support PVC-oriented executor pod allocation > > > Key: SPARK-41410 > URL: https://issues.apache.org/jira/browse/SPARK-41410 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aravind Patnam updated SPARK-41415: --- Summary: SASL Request Retries (was: SASL Request Retry) > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41415) SASL Request Retry
Aravind Patnam created SPARK-41415: -- Summary: SASL Request Retry Key: SPARK-41415 URL: https://issues.apache.org/jira/browse/SPARK-41415 Project: Spark Issue Type: Task Components: Shuffle Affects Versions: 3.2.4 Reporter: Aravind Patnam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation
[ https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644079#comment-17644079 ] Apache Spark commented on SPARK-41410: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38948 > Support PVC-oriented executor pod allocation > > > Key: SPARK-41410 > URL: https://issues.apache.org/jira/browse/SPARK-41410 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41410) Support PVC-oriented executor pod allocation
[ https://issues.apache.org/jira/browse/SPARK-41410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644080#comment-17644080 ] Apache Spark commented on SPARK-41410: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38948 > Support PVC-oriented executor pod allocation > > > Key: SPARK-41410 > URL: https://issues.apache.org/jira/browse/SPARK-41410 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644070#comment-17644070 ] Navin Viswanath commented on SPARK-41233: - PR : [https://github.com/apache/spark/pull/38947] > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41231) Built-in SQL Function Improvement
[ https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41231: Assignee: Apache Spark > Built-in SQL Function Improvement > - > > Key: SPARK-41231 > URL: https://issues.apache.org/jira/browse/SPARK-41231 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41231) Built-in SQL Function Improvement
[ https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644066#comment-17644066 ] Apache Spark commented on SPARK-41231: -- User 'navinvishy' has created a pull request for this issue: https://github.com/apache/spark/pull/38947 > Built-in SQL Function Improvement > - > > Key: SPARK-41231 > URL: https://issues.apache.org/jira/browse/SPARK-41231 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41231) Built-in SQL Function Improvement
[ https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41231: Assignee: (was: Apache Spark) > Built-in SQL Function Improvement > - > > Key: SPARK-41231 > URL: https://issues.apache.org/jira/browse/SPARK-41231 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41411) Multi-Stateful Operator watermark support bug fix
[ https://issues.apache.org/jira/browse/SPARK-41411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-41411. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38945 [https://github.com/apache/spark/pull/38945] > Multi-Stateful Operator watermark support bug fix > - > > Key: SPARK-41411 > URL: https://issues.apache.org/jira/browse/SPARK-41411 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Fix For: 3.4.0 > > > A typo in passing event time watermark to`StreamingSymmetricHashJoinExec` > causes logic errrors. With the bug, the query would work with no error > reported but producing incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41411) Multi-Stateful Operator watermark support bug fix
[ https://issues.apache.org/jira/browse/SPARK-41411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-41411: Assignee: Wei Liu > Multi-Stateful Operator watermark support bug fix > - > > Key: SPARK-41411 > URL: https://issues.apache.org/jira/browse/SPARK-41411 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > > A typo in passing event time watermark to`StreamingSymmetricHashJoinExec` > causes logic errrors. With the bug, the query would work with no error > reported but producing incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error
[ https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644061#comment-17644061 ] Pablo Langa Blanco commented on SPARK-41344: In this case the provider has been detected as DataSourceV2 and also implements SupportsCatalogOptions, so if it fails at that point, it does not make sense to try it as DataSource V1. The CatalogV2Util.loadTable function catches NoSuchTableException, NoSuchDatabaseException and NoSuchNamespaceException to return an option, which makes sense in other places where it is used, but not at this point. Maybe the best solution is to have another function that does not catch those exceptions to use in this case and does not return an option. > Reading V2 datasource masks underlying error > > > Key: SPARK-41344 > URL: https://issues.apache.org/jira/browse/SPARK-41344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.4.0 >Reporter: Kevin Cheung >Priority: Critical > Attachments: image-2022-12-03-09-24-43-285.png > > > In Spark 3.3, > # DataSourceV2Utils, the loadV2Source calls: > {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, > Some(catalog), Some(ident)). > # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with > any of these exceptions NoSuchTableException, NoSuchDatabaseException, > NoSuchNamespaceException, it would return None > # Coming back to DataSourceV2Utils, None was previously returned and calling > None.get results in a cryptic error technically "correct", but the *original > exceptions NoSuchTableException, NoSuchDatabaseException, > NoSuchNamespaceException are thrown away.* > > *Ask:* > Retain the original error and propagate this to the user. Prior to Spark 3.3, > the *original error* was shown and this seems like a design flaw. > > *Sample user facing error:* > None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209) > at scala.Option.flatMap(Option.scala:271) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171) > > *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get* > [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137] > *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))* > {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341] > *CatalogV2Util.scala - catching the exceptions and return None* > [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41414) Implement date/timestamp functions
[ https://issues.apache.org/jira/browse/SPARK-41414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644049#comment-17644049 ] Apache Spark commented on SPARK-41414: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/38946 > Implement date/timestamp functions > -- > > Key: SPARK-41414 > URL: https://issues.apache.org/jira/browse/SPARK-41414 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement data/timestamp functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41414) Implement date/timestamp functions
[ https://issues.apache.org/jira/browse/SPARK-41414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41414: Assignee: (was: Apache Spark) > Implement date/timestamp functions > -- > > Key: SPARK-41414 > URL: https://issues.apache.org/jira/browse/SPARK-41414 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement data/timestamp functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org