[jira] [Created] (SPARK-40388) SQL configuration spark.sql.mapKeyDedupPolicy not always applied
Paul Praet created SPARK-40388: -- Summary: SQL configuration spark.sql.mapKeyDedupPolicy not always applied Key: SPARK-40388 URL: https://issues.apache.org/jira/browse/SPARK-40388 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: Paul Praet I have set spark.sql.mapKeyDedupPolicy to LAST_WIN. However, I had still one failure where I got {quote}Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 1201.0 failed 4 times, most recent failure: Lost task 7.3 in stage 1201.0 (TID 1011313) (ip-10-1-34-47.eu-west-1.compute.internal executor 228): java.lang.RuntimeException: Duplicate map key domain was found, please check the input data. If you want to remove the duplicated keys, you can set spark.sql.mapKeyDedupPolicy to LAST_WIN so that the key inserted at last takes precedence. {quote} We are confident we set the right configuration in SparkConf (we can find it on the Spark UI -> Environment). It is our impression this configuration is not propagated reliably to the executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40354) Support eliminate dynamic partition for v1 writes
[ https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40354: Assignee: Apache Spark > Support eliminate dynamic partition for v1 writes > - > > Key: SPARK-40354 > URL: https://issues.apache.org/jira/browse/SPARK-40354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > v1 writes will add an extra sort for dynamic columns, e.g. > {code:java} > INSERT INTO TABLE t1 PARTITION(p) > SELECT c1, c2, 'a' as p FROM t2 {code} > if the dynamic columns are foldable, we can optimize it to: > {code:java} > INSERT INTO TABLE t1 PARTITION(p='a') > SELECT c1, c2 FROM t2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40354) Support eliminate dynamic partition for v1 writes
[ https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40354: Assignee: (was: Apache Spark) > Support eliminate dynamic partition for v1 writes > - > > Key: SPARK-40354 > URL: https://issues.apache.org/jira/browse/SPARK-40354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > v1 writes will add an extra sort for dynamic columns, e.g. > {code:java} > INSERT INTO TABLE t1 PARTITION(p) > SELECT c1, c2, 'a' as p FROM t2 {code} > if the dynamic columns are foldable, we can optimize it to: > {code:java} > INSERT INTO TABLE t1 PARTITION(p='a') > SELECT c1, c2 FROM t2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40354) Support eliminate dynamic partition for v1 writes
[ https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601673#comment-17601673 ] Apache Spark commented on SPARK-40354: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37831 > Support eliminate dynamic partition for v1 writes > - > > Key: SPARK-40354 > URL: https://issues.apache.org/jira/browse/SPARK-40354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > v1 writes will add an extra sort for dynamic columns, e.g. > {code:java} > INSERT INTO TABLE t1 PARTITION(p) > SELECT c1, c2, 'a' as p FROM t2 {code} > if the dynamic columns are foldable, we can optimize it to: > {code:java} > INSERT INTO TABLE t1 PARTITION(p='a') > SELECT c1, c2 FROM t2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40387) Improve the implementation of Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40387: Assignee: (was: Apache Spark) > Improve the implementation of Spark Decimal > --- > > Key: SPARK-40387 > URL: https://issues.apache.org/jira/browse/SPARK-40387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal always use ne first, but eq is better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40387) Improve the implementation of Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40387: Assignee: Apache Spark > Improve the implementation of Spark Decimal > --- > > Key: SPARK-40387 > URL: https://issues.apache.org/jira/browse/SPARK-40387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Spark Decimal always use ne first, but eq is better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40387) Improve the implementation of Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601672#comment-17601672 ] Apache Spark commented on SPARK-40387: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37830 > Improve the implementation of Spark Decimal > --- > > Key: SPARK-40387 > URL: https://issues.apache.org/jira/browse/SPARK-40387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal always use ne first, but eq is better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40387) Improve the implementation of Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601671#comment-17601671 ] Apache Spark commented on SPARK-40387: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37830 > Improve the implementation of Spark Decimal > --- > > Key: SPARK-40387 > URL: https://issues.apache.org/jira/browse/SPARK-40387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal always use ne first, but eq is better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40387) Improve the implementation of Spark Decimal
jiaan.geng created SPARK-40387: -- Summary: Improve the implementation of Spark Decimal Key: SPARK-40387 URL: https://issues.apache.org/jira/browse/SPARK-40387 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Spark Decimal always use ne first, but eq is better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40386) Implement `ddof` in `DataFrame.cov`
[ https://issues.apache.org/jira/browse/SPARK-40386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40386: Assignee: (was: Apache Spark) > Implement `ddof` in `DataFrame.cov` > --- > > Key: SPARK-40386 > URL: https://issues.apache.org/jira/browse/SPARK-40386 > Project: Spark > Issue Type: Sub-task > Components: ps, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40386) Implement `ddof` in `DataFrame.cov`
[ https://issues.apache.org/jira/browse/SPARK-40386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601651#comment-17601651 ] Apache Spark commented on SPARK-40386: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37829 > Implement `ddof` in `DataFrame.cov` > --- > > Key: SPARK-40386 > URL: https://issues.apache.org/jira/browse/SPARK-40386 > Project: Spark > Issue Type: Sub-task > Components: ps, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40386) Implement `ddof` in `DataFrame.cov`
[ https://issues.apache.org/jira/browse/SPARK-40386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40386: Assignee: Apache Spark > Implement `ddof` in `DataFrame.cov` > --- > > Key: SPARK-40386 > URL: https://issues.apache.org/jira/browse/SPARK-40386 > Project: Spark > Issue Type: Sub-task > Components: ps, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40386) Implement `ddof` in `DataFrame.cov`
Ruifeng Zheng created SPARK-40386: - Summary: Implement `ddof` in `DataFrame.cov` Key: SPARK-40386 URL: https://issues.apache.org/jira/browse/SPARK-40386 Project: Spark Issue Type: Sub-task Components: ps, SQL Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40385) Classes with companion object constructor fails interpreted path
Emil Ejbyfeldt created SPARK-40385: -- Summary: Classes with companion object constructor fails interpreted path Key: SPARK-40385 URL: https://issues.apache.org/jira/browse/SPARK-40385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.2, 3.3.0, 3.1.3 Reporter: Emil Ejbyfeldt The Encoder implemented in SPARK-8288 for classes with only a companion object constructor fails when using the interpreted path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed
[ https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40384: Assignee: (was: Apache Spark) > Do base image real in time build only when infra dockerfile is changed > -- > > Key: SPARK-40384 > URL: https://issues.apache.org/jira/browse/SPARK-40384 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed
[ https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601633#comment-17601633 ] Apache Spark commented on SPARK-40384: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37828 > Do base image real in time build only when infra dockerfile is changed > -- > > Key: SPARK-40384 > URL: https://issues.apache.org/jira/browse/SPARK-40384 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed
[ https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601632#comment-17601632 ] Apache Spark commented on SPARK-40384: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37828 > Do base image real in time build only when infra dockerfile is changed > -- > > Key: SPARK-40384 > URL: https://issues.apache.org/jira/browse/SPARK-40384 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed
[ https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40384: Assignee: Apache Spark > Do base image real in time build only when infra dockerfile is changed > -- > > Key: SPARK-40384 > URL: https://issues.apache.org/jira/browse/SPARK-40384 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed
Yikun Jiang created SPARK-40384: --- Summary: Do base image real in time build only when infra dockerfile is changed Key: SPARK-40384 URL: https://issues.apache.org/jira/browse/SPARK-40384 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.4.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40364) Unify `initDB` method in `DBProvider`
[ https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601622#comment-17601622 ] Apache Spark commented on SPARK-40364: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37826 > Unify `initDB` method in `DBProvider` > - > > Key: SPARK-40364 > URL: https://issues.apache.org/jira/browse/SPARK-40364 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > There are 2 `initDB` in `DBProvider` > # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, > ObjectMapper mapper) > # DB initDB(DBBackend dbBackend, File file) > and the second one only used by test ShuffleTestAccessor, we can explore > whether can keep only the first method > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40364) Unify `initDB` method in `DBProvider`
[ https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601620#comment-17601620 ] Apache Spark commented on SPARK-40364: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37826 > Unify `initDB` method in `DBProvider` > - > > Key: SPARK-40364 > URL: https://issues.apache.org/jira/browse/SPARK-40364 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > There are 2 `initDB` in `DBProvider` > # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, > ObjectMapper mapper) > # DB initDB(DBBackend dbBackend, File file) > and the second one only used by test ShuffleTestAccessor, we can explore > whether can keep only the first method > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt
[ https://issues.apache.org/jira/browse/SPARK-40383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40383: Assignee: (was: Apache Spark) > Pin mypy ==0.920 in dev/requirements.txt > > > Key: SPARK-40383 > URL: https://issues.apache.org/jira/browse/SPARK-40383 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt
[ https://issues.apache.org/jira/browse/SPARK-40383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40383: Assignee: Apache Spark > Pin mypy ==0.920 in dev/requirements.txt > > > Key: SPARK-40383 > URL: https://issues.apache.org/jira/browse/SPARK-40383 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40364) Unify `initDB` method in `DBProvider`
[ https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40364: Assignee: Apache Spark > Unify `initDB` method in `DBProvider` > - > > Key: SPARK-40364 > URL: https://issues.apache.org/jira/browse/SPARK-40364 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > There are 2 `initDB` in `DBProvider` > # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, > ObjectMapper mapper) > # DB initDB(DBBackend dbBackend, File file) > and the second one only used by test ShuffleTestAccessor, we can explore > whether can keep only the first method > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt
[ https://issues.apache.org/jira/browse/SPARK-40383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601621#comment-17601621 ] Apache Spark commented on SPARK-40383: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37827 > Pin mypy ==0.920 in dev/requirements.txt > > > Key: SPARK-40383 > URL: https://issues.apache.org/jira/browse/SPARK-40383 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40364) Unify `initDB` method in `DBProvider`
[ https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40364: Assignee: (was: Apache Spark) > Unify `initDB` method in `DBProvider` > - > > Key: SPARK-40364 > URL: https://issues.apache.org/jira/browse/SPARK-40364 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > There are 2 `initDB` in `DBProvider` > # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, > ObjectMapper mapper) > # DB initDB(DBBackend dbBackend, File file) > and the second one only used by test ShuffleTestAccessor, we can explore > whether can keep only the first method > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt
Ruifeng Zheng created SPARK-40383: - Summary: Pin mypy ==0.920 in dev/requirements.txt Key: SPARK-40383 URL: https://issues.apache.org/jira/browse/SPARK-40383 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40186) mergedShuffleCleaner should have been shutdown before db closed
[ https://issues.apache.org/jira/browse/SPARK-40186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-40186. - Target Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed > mergedShuffleCleaner should have been shutdown before db closed > --- > > Key: SPARK-40186 > URL: https://issues.apache.org/jira/browse/SPARK-40186 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > Should ensure `RemoteBlockPushResolver#mergedShuffleCleaner` have been > shutdown before `RemoteBlockPushResolver#db` closed, otherwise, > `RemoteBlockPushResolver#applicationRemoved` may perform delete operations on > a closed db. > > https://github.com/apache/spark/pull/37610#discussion_r951185256 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-40362: - Description: In the canonicalization code which is now in two stages, canonicalization involving Commutative operators is broken, if they are subexpressions of certain type of expressions which override precanonicalize, for example BinaryComparison Consider following expression: a + b > 10 GT | a + b 10 The BinaryComparison operator in the precanonicalize, first precanonicalizes children & then may swap operands based on left /right hashCode inequality.. lets say Add(a + b) .hashCode is > 10.hashCode as a result GT is converted to LT But If the same tree is created GT | b + a 10 The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that for this tree Add(b + a) .hashCode is < 10.hashCode in which case GT remains as is. Thus to similar trees result in different canonicalization , one having GT other having LT The problem occurs because for commutative expressions the canonicalization normalizes the expression with consistent hashCode which is not the case with precanonicalize as the hashCode of commutative expression 's precanonicalize and post canonicalize are different. The test {quote}test("bug X") Unknown macro: \{ val tr1 = LocalRelation('c.int, 'b.string, 'a.int) val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match Unknown macro} ).clone().asInstanceOf[Add] val canonicalizedFullCond = fullCond.canonicalized // swap the operands of add val newAddExpr = Add(addExpr.right, addExpr.left) // build a new condition which is same as the previous one, but with operands of //Add reversed val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized assertEquals(canonicalizedFullCond, builtCondnCanonicalized) } {quote} This test fails. The fix which I propose is that for the commutative expressions, the precanonicalize should be overridden and Canonicalize.reorderCommutativeOperators be invoked on the expression instead of at place of canonicalize. effectively for commutative operands ( add, or , multiply , and etc) canonicalize and precanonicalize should be same. PR: [https://github.com/apache/spark/pull/37824] I am also trying a better fix, where by the idea is that for commutative expressions the murmur hashCode are caluculated using unorderedHash so that it is order independent ( i.e symmetric). The above approach works fine , but in case of Least & Greatest, the Product's element is a Seq, and that messes with consistency of hashCode. was: In the canonicalization code which is now in two stages, canonicalization involving Commutative operators is broken, if they are subexpressions of certain type of expressions which override precanonicalize. Consider following expression: a + b > 10 This GreaterThan expression when canonicalized as a whole for first time, will skip the call to Canonicalize.reorderCommutativeOperators for the Add expression as the GreaterThan's canonicalization used precanonicalize on children ( the Add expression). so if create a new expression b + a > 10 and invoke canonicalize it, the canonicalized versions of these two expressions will not match. The test {quote}test("bug X") { val tr1 = LocalRelation('c.int, 'b.string, 'a.int) val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match Unknown macro: \{ case GreaterThan(x} ).clone().asInstanceOf[Add] val canonicalizedFullCond = fullCond.canonicalized // swap the operands of add val newAddExpr = Add(addExpr.right, addExpr.left) // build a new condition which is same as the previous one, but with operands of //Add reversed val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized assertEquals(canonicalizedFullCond, builtCondnCanonicalized) } {quote} This test fails. The fix which I propose is that for the commutative expressions, the precanonicalize should be overridden and Canonicalize.reorderCommutativeOperators be invoked on the expression instead of at place of canonicalize. effectively for commutative operands ( add, or , multiply , and etc) canonicalize and precanonicalize should be same. PR: https://github.com/apache/spark/pull/37824 > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >
[jira] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362 ] Asif deleted comment on SPARK-40362: -- was (Author: ashahid7): I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } } > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize, for example > BinaryComparison > Consider following expression: > a + b > 10 > GT > | > a + b 10 > The BinaryComparison operator in the precanonicalize, first precanonicalizes > children & then may swap operands based on left /right hashCode inequality.. > lets say Add(a + b) .hashCode is > 10.hashCode as a result GT is converted > to LT > But If the same tree is created > GT > | > b + a 10 > The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that > for this tree > Add(b + a) .hashCode is < 10.hashCode in which case GT remains as is. > Thus to similar trees result in different canonicalization , one having GT > other having LT > > The problem occurs because for commutative expressions the canonicalization > normalizes the expression with consistent hashCode which is not the case with > precanonicalize as the hashCode of commutative expression 's precanonicalize > and post canonicalize are different. > > > The test > {quote}test("bug X") > Unknown macro: \{ val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = > y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match > Unknown macro} > ).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or , > multiply , and etc) canonicalize and precanonicalize should be same. > PR: > [https://github.com/apache/spark/pull/37824] > > > I am also trying a better fix, where by the idea is that for commutative > expressions the murmur hashCode are caluculated using unorderedHash so that > it is order independent ( i.e symmetric). > The above approach works fine , but in case of Least & Greatest, the > Product's element is a Seq, and that messes with consistency of hashCode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362 ] Asif deleted comment on SPARK-40362: -- was (Author: ashahid7): will be generating a PR tomorrow.. > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize, for example > BinaryComparison > Consider following expression: > a + b > 10 > GT > | > a + b 10 > The BinaryComparison operator in the precanonicalize, first precanonicalizes > children & then may swap operands based on left /right hashCode inequality.. > lets say Add(a + b) .hashCode is > 10.hashCode as a result GT is converted > to LT > But If the same tree is created > GT > | > b + a 10 > The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that > for this tree > Add(b + a) .hashCode is < 10.hashCode in which case GT remains as is. > Thus to similar trees result in different canonicalization , one having GT > other having LT > > The problem occurs because for commutative expressions the canonicalization > normalizes the expression with consistent hashCode which is not the case with > precanonicalize as the hashCode of commutative expression 's precanonicalize > and post canonicalize are different. > > > The test > {quote}test("bug X") > Unknown macro: \{ val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = > y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match > Unknown macro} > ).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or , > multiply , and etc) canonicalize and precanonicalize should be same. > PR: > [https://github.com/apache/spark/pull/37824] > > > I am also trying a better fix, where by the idea is that for commutative > expressions the murmur hashCode are caluculated using unorderedHash so that > it is order independent ( i.e symmetric). > The above approach works fine , but in case of Least & Greatest, the > Product's element is a Seq, and that messes with consistency of hashCode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.9.3
[ https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40365. -- Resolution: Fixed Issue resolved by pull request 37814 [https://github.com/apache/spark/pull/37814] > Bump ANTLR runtime version from 4.8 to 4.9.3 > > > Key: SPARK-40365 > URL: https://issues.apache.org/jira/browse/SPARK-40365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.9.3
[ https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40365: Assignee: BingKun Pan > Bump ANTLR runtime version from 4.8 to 4.9.3 > > > Key: SPARK-40365 > URL: https://issues.apache.org/jira/browse/SPARK-40365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-40378. - Resolution: Invalid > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.3 >Reporter: Nikhil Sharma >Priority: Major > > React Native is an open-source framework for building mobile apps. It was > created by Facebook and is designed for cross-platform capability. It can be > tough to choose between an excellent user experience, a beautiful user > interface, and fast processing, but [React Native online > course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] > makes that decision an easy one with powerful native development. Jordan > Walke found a way to generate UI elements from a javascript thread and > applied it to iOS to build the first native application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40378: Target Version/s: (was: 3.1.2) > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.3 >Reporter: Nikhil Sharma >Priority: Major > > React Native is an open-source framework for building mobile apps. It was > created by Facebook and is designed for cross-platform capability. It can be > tough to choose between an excellent user experience, a beautiful user > interface, and fast processing, but [React Native online > course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] > makes that decision an easy one with powerful native development. Jordan > Walke found a way to generate UI elements from a javascript thread and > applied it to iOS to build the first native application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40378: Fix Version/s: (was: 3.1.3) > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.3 >Reporter: Nikhil Sharma >Priority: Major > > React Native is an open-source framework for building mobile apps. It was > created by Facebook and is designed for cross-platform capability. It can be > tough to choose between an excellent user experience, a beautiful user > interface, and fast processing, but [React Native online > course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] > makes that decision an easy one with powerful native development. Jordan > Walke found a way to generate UI elements from a javascript thread and > applied it to iOS to build the first native application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children
[ https://issues.apache.org/jira/browse/SPARK-40382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601554#comment-17601554 ] Apache Spark commented on SPARK-40382: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/37825 > Reduce projections in Expand when multiple distinct aggregations have > semantically equivalent children > -- > > Key: SPARK-40382 > URL: https://issues.apache.org/jira/browse/SPARK-40382 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > In RewriteDistinctAggregates, when grouping aggregate expressions by function > children, we should treat children that are semantically equivalent as the > same. > This proposed change potentially reduces the number of projections in the > Expand operator added to a plan. In some cases, it may eliminate the need for > an Expand operator. > Example: In the following query, the Expand operator creates 3*n rows (where > n is the number of incoming rows) because it has a projection for function > children b + 1, 1 + b and c. > {noformat} > create or replace temp view v1 as > select * from values > (1, 2, 3.0), > (1, 3, 4.0), > (2, 4, 2.5), > (2, 3, 1.0) > v1(a, b, c); > select > a, > count(distinct b + 1), > avg(distinct 1 + b) filter (where c > 0), > sum(c) > from > v1 > group by a; > {noformat} > The Expand operator has three projections (each producing a row for each > incoming row): > {noformat} > [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for > regular aggregation) > [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for > distinct aggregation of b + 1) > [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for > distinct aggregation of 1 + b) > {noformat} > In reality, the Expand only needs one projection for 1 + b and b + 1, because > they are semantically equivalent. > With the proposed change, the Expand operator's projections look like this: > {noformat} > [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular > aggregations) > [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct > aggregation on b + 1 and 1 + b) > {noformat} > With one less projection, Expand produces n*2 rows instead of n*3 rows, but > still produces the correct result. > In the case where all distinct aggregates have semantically equivalent > children, the Expand operator is not needed at all. > Assume this benchmark: > {noformat} > runBenchmark("distinct aggregates") { > val N = 20 << 22 > val benchmark = new Benchmark("distinct aggregates", N, output = output) > spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1") > .createOrReplaceTempView("test") > def f1(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 1 + id1), > avg(distinct 1 + ID1) > from > test > group by k""").noop() > benchmark.addCase("all semantically equivalent", numIters = 2) { _ => > f1() > } > def f2(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 1 + id1), > avg(distinct 2 + ID1) > from > test > group by k""").noop() > benchmark.addCase("some semantically equivalent", numIters = 2) { _ => > f2() > } > def f3(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 3 + id1), > avg(distinct 2 + ID1) > from > test > group by k""").noop() > benchmark.addCase("none semantically equivalent", numIters = 2) { _ => > f3() > } > benchmark.run() > } > {noformat} > Before the change: > {noformat} > [info] distinct aggregates: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > [info] > > [info] all semantically equivalent 14721 14859 > 195 5.7 175.5 1.0X > [info] some semantically equivalent 14569 14572 > 5 5.8 173.7 1.0X > [info] none semantically equivalent 14408 14488 > 113 5.8 171.8 1.0X > {noformat} > After the propose
[jira] [Assigned] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children
[ https://issues.apache.org/jira/browse/SPARK-40382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40382: Assignee: (was: Apache Spark) > Reduce projections in Expand when multiple distinct aggregations have > semantically equivalent children > -- > > Key: SPARK-40382 > URL: https://issues.apache.org/jira/browse/SPARK-40382 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > In RewriteDistinctAggregates, when grouping aggregate expressions by function > children, we should treat children that are semantically equivalent as the > same. > This proposed change potentially reduces the number of projections in the > Expand operator added to a plan. In some cases, it may eliminate the need for > an Expand operator. > Example: In the following query, the Expand operator creates 3*n rows (where > n is the number of incoming rows) because it has a projection for function > children b + 1, 1 + b and c. > {noformat} > create or replace temp view v1 as > select * from values > (1, 2, 3.0), > (1, 3, 4.0), > (2, 4, 2.5), > (2, 3, 1.0) > v1(a, b, c); > select > a, > count(distinct b + 1), > avg(distinct 1 + b) filter (where c > 0), > sum(c) > from > v1 > group by a; > {noformat} > The Expand operator has three projections (each producing a row for each > incoming row): > {noformat} > [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for > regular aggregation) > [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for > distinct aggregation of b + 1) > [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for > distinct aggregation of 1 + b) > {noformat} > In reality, the Expand only needs one projection for 1 + b and b + 1, because > they are semantically equivalent. > With the proposed change, the Expand operator's projections look like this: > {noformat} > [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular > aggregations) > [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct > aggregation on b + 1 and 1 + b) > {noformat} > With one less projection, Expand produces n*2 rows instead of n*3 rows, but > still produces the correct result. > In the case where all distinct aggregates have semantically equivalent > children, the Expand operator is not needed at all. > Assume this benchmark: > {noformat} > runBenchmark("distinct aggregates") { > val N = 20 << 22 > val benchmark = new Benchmark("distinct aggregates", N, output = output) > spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1") > .createOrReplaceTempView("test") > def f1(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 1 + id1), > avg(distinct 1 + ID1) > from > test > group by k""").noop() > benchmark.addCase("all semantically equivalent", numIters = 2) { _ => > f1() > } > def f2(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 1 + id1), > avg(distinct 2 + ID1) > from > test > group by k""").noop() > benchmark.addCase("some semantically equivalent", numIters = 2) { _ => > f2() > } > def f3(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 3 + id1), > avg(distinct 2 + ID1) > from > test > group by k""").noop() > benchmark.addCase("none semantically equivalent", numIters = 2) { _ => > f3() > } > benchmark.run() > } > {noformat} > Before the change: > {noformat} > [info] distinct aggregates: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > [info] > > [info] all semantically equivalent 14721 14859 > 195 5.7 175.5 1.0X > [info] some semantically equivalent 14569 14572 > 5 5.8 173.7 1.0X > [info] none semantically equivalent 14408 14488 > 113 5.8 171.8 1.0X > {noformat} > After the proposed change: > {noformat} > [info] distinct aggregates: Best Time(ms) Avg Time(ms) > Stdev(ms)
[jira] [Assigned] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children
[ https://issues.apache.org/jira/browse/SPARK-40382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40382: Assignee: Apache Spark > Reduce projections in Expand when multiple distinct aggregations have > semantically equivalent children > -- > > Key: SPARK-40382 > URL: https://issues.apache.org/jira/browse/SPARK-40382 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > In RewriteDistinctAggregates, when grouping aggregate expressions by function > children, we should treat children that are semantically equivalent as the > same. > This proposed change potentially reduces the number of projections in the > Expand operator added to a plan. In some cases, it may eliminate the need for > an Expand operator. > Example: In the following query, the Expand operator creates 3*n rows (where > n is the number of incoming rows) because it has a projection for function > children b + 1, 1 + b and c. > {noformat} > create or replace temp view v1 as > select * from values > (1, 2, 3.0), > (1, 3, 4.0), > (2, 4, 2.5), > (2, 3, 1.0) > v1(a, b, c); > select > a, > count(distinct b + 1), > avg(distinct 1 + b) filter (where c > 0), > sum(c) > from > v1 > group by a; > {noformat} > The Expand operator has three projections (each producing a row for each > incoming row): > {noformat} > [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for > regular aggregation) > [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for > distinct aggregation of b + 1) > [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for > distinct aggregation of 1 + b) > {noformat} > In reality, the Expand only needs one projection for 1 + b and b + 1, because > they are semantically equivalent. > With the proposed change, the Expand operator's projections look like this: > {noformat} > [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular > aggregations) > [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct > aggregation on b + 1 and 1 + b) > {noformat} > With one less projection, Expand produces n*2 rows instead of n*3 rows, but > still produces the correct result. > In the case where all distinct aggregates have semantically equivalent > children, the Expand operator is not needed at all. > Assume this benchmark: > {noformat} > runBenchmark("distinct aggregates") { > val N = 20 << 22 > val benchmark = new Benchmark("distinct aggregates", N, output = output) > spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1") > .createOrReplaceTempView("test") > def f1(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 1 + id1), > avg(distinct 1 + ID1) > from > test > group by k""").noop() > benchmark.addCase("all semantically equivalent", numIters = 2) { _ => > f1() > } > def f2(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 1 + id1), > avg(distinct 2 + ID1) > from > test > group by k""").noop() > benchmark.addCase("some semantically equivalent", numIters = 2) { _ => > f2() > } > def f3(): Unit = spark.sql( > """ > select > k, > sum(distinct id1 + 1), > count(distinct 3 + id1), > avg(distinct 2 + ID1) > from > test > group by k""").noop() > benchmark.addCase("none semantically equivalent", numIters = 2) { _ => > f3() > } > benchmark.run() > } > {noformat} > Before the change: > {noformat} > [info] distinct aggregates: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > [info] > > [info] all semantically equivalent 14721 14859 > 195 5.7 175.5 1.0X > [info] some semantically equivalent 14569 14572 > 5 5.8 173.7 1.0X > [info] none semantically equivalent 14408 14488 > 113 5.8 171.8 1.0X > {noformat} > After the proposed change: > {noformat} > [info] distinct aggregates: Best Time(ms) Avg
[jira] [Created] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children
Bruce Robbins created SPARK-40382: - Summary: Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children Key: SPARK-40382 URL: https://issues.apache.org/jira/browse/SPARK-40382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Bruce Robbins In RewriteDistinctAggregates, when grouping aggregate expressions by function children, we should treat children that are semantically equivalent as the same. This proposed change potentially reduces the number of projections in the Expand operator added to a plan. In some cases, it may eliminate the need for an Expand operator. Example: In the following query, the Expand operator creates 3*n rows (where n is the number of incoming rows) because it has a projection for function children b + 1, 1 + b and c. {noformat} create or replace temp view v1 as select * from values (1, 2, 3.0), (1, 3, 4.0), (2, 4, 2.5), (2, 3, 1.0) v1(a, b, c); select a, count(distinct b + 1), avg(distinct 1 + b) filter (where c > 0), sum(c) from v1 group by a; {noformat} The Expand operator has three projections (each producing a row for each incoming row): {noformat} [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation) [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for distinct aggregation of b + 1) [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b) {noformat} In reality, the Expand only needs one projection for 1 + b and b + 1, because they are semantically equivalent. With the proposed change, the Expand operator's projections look like this: {noformat} [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular aggregations) [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct aggregation on b + 1 and 1 + b) {noformat} With one less projection, Expand produces n*2 rows instead of n*3 rows, but still produces the correct result. In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all. Assume this benchmark: {noformat} runBenchmark("distinct aggregates") { val N = 20 << 22 val benchmark = new Benchmark("distinct aggregates", N, output = output) spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1") .createOrReplaceTempView("test") def f1(): Unit = spark.sql( """ select k, sum(distinct id1 + 1), count(distinct 1 + id1), avg(distinct 1 + ID1) from test group by k""").noop() benchmark.addCase("all semantically equivalent", numIters = 2) { _ => f1() } def f2(): Unit = spark.sql( """ select k, sum(distinct id1 + 1), count(distinct 1 + id1), avg(distinct 2 + ID1) from test group by k""").noop() benchmark.addCase("some semantically equivalent", numIters = 2) { _ => f2() } def f3(): Unit = spark.sql( """ select k, sum(distinct id1 + 1), count(distinct 3 + id1), avg(distinct 2 + ID1) from test group by k""").noop() benchmark.addCase("none semantically equivalent", numIters = 2) { _ => f3() } benchmark.run() } {noformat} Before the change: {noformat} [info] distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] [info] all semantically equivalent 14721 14859 195 5.7 175.5 1.0X [info] some semantically equivalent 14569 14572 5 5.8 173.7 1.0X [info] none semantically equivalent 14408 14488 113 5.8 171.8 1.0X {noformat} After the proposed change: {noformat} [info] distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] [info] all semantically equivalent3658 3692 49 22.9 43.6 1.0X [info] some semantically equivalent 9124 9214 127 9.2 108.8 0.4X [info] none semantically equivalent 14601 14777 2
[jira] [Comment Edited] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung
[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601543#comment-17601543 ] Mars edited comment on SPARK-40320 at 9/7/22 10:50 PM: --- [~Ngone51] Shouldn't it bring up a new `receiveLoop()` to serve RPC messages? Yes, my previous thinking was wrong. I remote debug on Executor and I found that it did catch the fatal error in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89] . It will resubmit receiveLoop and in the second time it will block by [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69] This Executor did not initialize successfully in the first time , so it didn't send LaunchedExecutor to Driver (you can see [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172] ) So the Executor can't launch task, related PR [https://github.com/apache/spark/pull/25964] . Why SparkUncaughtExceptionHandler doesn't catch the fatal error? See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284] plugins is private variable, so it was broken when initialize Executor at the beginning. was (Author: JIRAUSER290821): [~Ngone51] Shouldn't it bring up a new `receiveLoop()` to serve RPC messages? Yes, my previous thinking was wrong. I remote debug on Executor and I found that it did catch the fatal error in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89] . It will resubmit receiveLoop and in the second time it will block by [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69] This Executor did not initialize successfully in the first time and didn't send LaunchedExecutor to Driver (you can see [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172] ) So the Executor can't launch task, related PR [https://github.com/apache/spark/pull/25964] . Why SparkUncaughtExceptionHandler doesn't catch the fatal error? See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284] plugins is private variable, so it was broken when initialize Executor at the beginning. > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > --- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Mars >Priority: Major > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** >*/ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** >*/ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method > `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM > process is active but the communication thread is no longer working ( > please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, > so executor doesn't receive any message) > Some ideas: > I think it is very hard to know what happened here unless we check in the > code. The Executor is active but it can't do anything. We will wonder if the > driver is broken or the Executor problem. I think at least the Executor > status shouldn't be active here or the Executor can exitExecutor (kill itself) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h
[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-40311: Docs Text: Add withColumnsRenamed to scala and pyspark API > Introduce withColumnsRenamed > > > Key: SPARK-40311 > URL: https://issues.apache.org/jira/browse/SPARK-40311 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Priority: Minor > > Add a scala, pyspark, R dataframe API that can rename multiple columns in a > single command. Issues are faced when users iteratively perform > `withColumnRenamed`. > * When it works, we see slower performace > * In some cases, StackOverflowError is raised due to logical plan being too > big > * In a few cases, driver died due to memory consumption > Some reproducible benchmarks: > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > b = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > c = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > d = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > e = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > f = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > g = datetime.datetime.now() > g-a > datetime.timedelta(seconds=12, microseconds=480021) {code} > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > b = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > c = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > d = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > e = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > f = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > g = datetime.datetime.now() > g-a > datetime.timedelta(microseconds=632116) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-40087: Docs Text: Support for SparkR to drop multiple "Column" > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Assignee: Santosh Pingale >Priority: Minor > Fix For: 3.4.0 > > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, df$age, column("random"))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39895) pyspark drop doesn't accept *cols
[ https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-39895: Docs Text: Support for PySpark to drop multiple "Column" > pyspark drop doesn't accept *cols > -- > > Key: SPARK-39895 > URL: https://issues.apache.org/jira/browse/SPARK-39895 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Assignee: Santosh Pingale >Priority: Minor > Fix For: 3.4.0 > > > Pyspark dataframe drop has following signature: > {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> > "DataFrame":}}{color} > However when we try to pass multiple Column types to drop function it raises > TypeError > {{each col in the param list should be a string}} > *Minimal reproducible example:* > {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), > ("id_1", 3, 3), ("id_2", 4, 3)]{color} > {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, > count int"){color} > |– id: string (nullable = true)| > |– point: integer (nullable = true)| > |– count: integer (nullable = true)| > {color:#4c9aff}{{df.drop(df.point, df.count)}}{color} > {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py > in drop(self, *cols){color} > {color:#505f79}2537 for col in cols:{color} > {color:#505f79}2538 if not isinstance(col, str):{color} > {color:#505f79}-> 2539 raise TypeError("each col in the param list should be > a string"){color} > {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color} > {color:#505f79}2541{color} > {color:#505f79}TypeError: each col in the param list should be a string{color} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung
[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601543#comment-17601543 ] Mars commented on SPARK-40320: -- [~Ngone51] Shouldn't it bring up a new `receiveLoop()` to serve RPC messages? Yes, my previous thinking was wrong. I remote debug on Executor and I found that it did catch the fatal error in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89] . It will resubmit receiveLoop and block in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69] But this Executor did not initialize successfully and didn't send LaunchedExecutor to Driver (you can see [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172] ) So the Executor can't launch task, related PR [https://github.com/apache/spark/pull/25964] . Why SparkUncaughtExceptionHandler doesn't catch the fatal error? See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284] plugins is private variable, so it was broken when initialize Executor at the beginning. > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > --- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Mars >Priority: Major > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** >*/ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** >*/ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method > `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM > process is active but the communication thread is no longer working ( > please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, > so executor doesn't receive any message) > Some ideas: > I think it is very hard to know what happened here unless we check in the > code. The Executor is active but it can't do anything. We will wonder if the > driver is broken or the Executor problem. I think at least the Executor > status shouldn't be active here or the Executor can exitExecutor (kill itself) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-40362: - Description: In the canonicalization code which is now in two stages, canonicalization involving Commutative operators is broken, if they are subexpressions of certain type of expressions which override precanonicalize. Consider following expression: a + b > 10 This GreaterThan expression when canonicalized as a whole for first time, will skip the call to Canonicalize.reorderCommutativeOperators for the Add expression as the GreaterThan's canonicalization used precanonicalize on children ( the Add expression). so if create a new expression b + a > 10 and invoke canonicalize it, the canonicalized versions of these two expressions will not match. The test {quote}test("bug X") { val tr1 = LocalRelation('c.int, 'b.string, 'a.int) val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match Unknown macro: \{ case GreaterThan(x} ).clone().asInstanceOf[Add] val canonicalizedFullCond = fullCond.canonicalized // swap the operands of add val newAddExpr = Add(addExpr.right, addExpr.left) // build a new condition which is same as the previous one, but with operands of //Add reversed val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized assertEquals(canonicalizedFullCond, builtCondnCanonicalized) } {quote} This test fails. The fix which I propose is that for the commutative expressions, the precanonicalize should be overridden and Canonicalize.reorderCommutativeOperators be invoked on the expression instead of at place of canonicalize. effectively for commutative operands ( add, or , multiply , and etc) canonicalize and precanonicalize should be same. PR: https://github.com/apache/spark/pull/37824 was: In the canonicalization code which is now in two stages, canonicalization involving Commutative operators is broken, if they are subexpressions of certain type of expressions which override precanonicalize. Consider following expression: a + b > 10 This GreaterThan expression when canonicalized as a whole for first time, will skip the call to Canonicalize.reorderCommutativeOperators for the Add expression as the GreaterThan's canonicalization used precanonicalize on children ( the Add expression). so if create a new expression b + a > 10 and invoke canonicalize it, the canonicalized versions of these two expressions will not match. The test {quote} test("bug X") { val tr1 = LocalRelation('c.int, 'b.string, 'a.int) val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match { case GreaterThan(x: Add, _) => x case LessThan(_, x: Add) => x }).clone().asInstanceOf[Add] val canonicalizedFullCond = fullCond.canonicalized // swap the operands of add val newAddExpr = Add(addExpr.right, addExpr.left) // build a new condition which is same as the previous one, but with operands of //Add reversed val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized assertEquals(canonicalizedFullCond, builtCondnCanonicalized) } {quote} This test fails. The fix which I propose is that for the commutative expressions, the precanonicalize should be overridden and Canonicalize.reorderCommutativeOperators be invoked on the expression instead of at place of canonicalize. effectively for commutative operands ( add, or , multiply , and etc) canonicalize and precanonicalize should be same. I will file a PR for it > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote}test("bug X") { > val tr1
[jira] [Updated] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-40362: - Shepherd: Wenchen Fan (was: Tanel Kiis) > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote} > test("bug X") { > val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze > val fullCond = y.asInstanceOf[Filter].condition.clone() > val addExpr = (fullCond match { > case GreaterThan(x: Add, _) => x > case LessThan(_, x: Add) => x > }).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or > , multiply , and etc) canonicalize and precanonicalize should be same. > I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40291) Improve the message for column not in group by clause error
[ https://issues.apache.org/jira/browse/SPARK-40291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40291. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37742 [https://github.com/apache/spark/pull/37742] > Improve the message for column not in group by clause error > --- > > Key: SPARK-40291 > URL: https://issues.apache.org/jira/browse/SPARK-40291 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.4.0 > > > Improve the message for column not in group by clause error to use the new > error framework -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40291) Improve the message for column not in group by clause error
[ https://issues.apache.org/jira/browse/SPARK-40291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40291: Assignee: Linhong Liu > Improve the message for column not in group by clause error > --- > > Key: SPARK-40291 > URL: https://issues.apache.org/jira/browse/SPARK-40291 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > > Improve the message for column not in group by clause error to use the new > error framework -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40293) Make the V2 table error message more meaningful
[ https://issues.apache.org/jira/browse/SPARK-40293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40293. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37746 [https://github.com/apache/spark/pull/37746] > Make the V2 table error message more meaningful > --- > > Key: SPARK-40293 > URL: https://issues.apache.org/jira/browse/SPARK-40293 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.4.0 > > > When V2 catalog is not configured, Spark fails to access/create a table using > the V2 API and silently falls back to attempting to do the same operation > using the V1 Api. This happens frequently among the users. We want to have a > better error message so that users can fix the configuration/usage issue by > themselves. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40293) Make the V2 table error message more meaningful
[ https://issues.apache.org/jira/browse/SPARK-40293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40293: Assignee: Huaxin Gao > Make the V2 table error message more meaningful > --- > > Key: SPARK-40293 > URL: https://issues.apache.org/jira/browse/SPARK-40293 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > When V2 catalog is not configured, Spark fails to access/create a table using > the V2 API and silently falls back to attempting to do the same operation > using the V1 Api. This happens frequently among the users. We want to have a > better error message so that users can fix the configuration/usage issue by > themselves. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result
[ https://issues.apache.org/jira/browse/SPARK-40380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40380: Assignee: (was: Apache Spark) > Constant-folding of InvokeLike should not result in non-serializable result > --- > > Key: SPARK-40380 > URL: https://issues.apache.org/jira/browse/SPARK-40380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kris Mok >Priority: Major > > SPARK-37907 added constant-folding support to the {{InvokeLike}} family of > expressions. Unfortunately it introduced a regression for cases when a > constant-folded {{InvokeLike}} expression returned a non-serializable result. > {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. > when using sparksql-scalapb on Spark 3.3.0+. > Below is a minimal repro to demonstrate this issue: > {code:scala} > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute > import org.apache.spark.sql.catalyst.expressions.Literal > import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, > StaticInvoke} > import org.apache.spark.sql.types.{LongType, ObjectType} > class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = > longVal + other } > case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): > NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } > val litExpr = Literal.fromObject(SerializableBoxedLong(42L), > ObjectType(classOf[SerializableBoxedLong])) > val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", > ObjectType(classOf[NotSerializableBoxedLong])) > val addExpr = Invoke(toNotSerializableExpr, "add", LongType, > Seq(UnresolvedAttribute.quotedString("id"))) > val df = spark.range(2).select(new Column(addExpr)) > df.collect > {code} > Before SPARK-37907, this example would run fine and result in {{[[42], > [43]]}}. But after SPARK-37907, it'd fail with: > {code:none} > ... > Caused by: java.io.NotSerializableException: NotSerializableBoxedLong > Serialization stack: > - object not serializable (class: NotSerializableBoxedLong, value: > NotSerializableBoxedLong@71231636) > - element of array (index: 1) > - array (class [Ljava.lang.Object;, size 2) > - element of array (index: 1) > - array (class [Ljava.lang.Object;, size 3) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class > org.apache.spark.sql.execution.WholeStageCodegenExec, > functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=3]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class > org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, > org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result
[ https://issues.apache.org/jira/browse/SPARK-40380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40380: Assignee: Apache Spark > Constant-folding of InvokeLike should not result in non-serializable result > --- > > Key: SPARK-40380 > URL: https://issues.apache.org/jira/browse/SPARK-40380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Major > > SPARK-37907 added constant-folding support to the {{InvokeLike}} family of > expressions. Unfortunately it introduced a regression for cases when a > constant-folded {{InvokeLike}} expression returned a non-serializable result. > {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. > when using sparksql-scalapb on Spark 3.3.0+. > Below is a minimal repro to demonstrate this issue: > {code:scala} > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute > import org.apache.spark.sql.catalyst.expressions.Literal > import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, > StaticInvoke} > import org.apache.spark.sql.types.{LongType, ObjectType} > class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = > longVal + other } > case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): > NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } > val litExpr = Literal.fromObject(SerializableBoxedLong(42L), > ObjectType(classOf[SerializableBoxedLong])) > val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", > ObjectType(classOf[NotSerializableBoxedLong])) > val addExpr = Invoke(toNotSerializableExpr, "add", LongType, > Seq(UnresolvedAttribute.quotedString("id"))) > val df = spark.range(2).select(new Column(addExpr)) > df.collect > {code} > Before SPARK-37907, this example would run fine and result in {{[[42], > [43]]}}. But after SPARK-37907, it'd fail with: > {code:none} > ... > Caused by: java.io.NotSerializableException: NotSerializableBoxedLong > Serialization stack: > - object not serializable (class: NotSerializableBoxedLong, value: > NotSerializableBoxedLong@71231636) > - element of array (index: 1) > - array (class [Ljava.lang.Object;, size 2) > - element of array (index: 1) > - array (class [Ljava.lang.Object;, size 3) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class > org.apache.spark.sql.execution.WholeStageCodegenExec, > functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=3]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class > org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, > org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result
[ https://issues.apache.org/jira/browse/SPARK-40380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601472#comment-17601472 ] Apache Spark commented on SPARK-40380: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/37823 > Constant-folding of InvokeLike should not result in non-serializable result > --- > > Key: SPARK-40380 > URL: https://issues.apache.org/jira/browse/SPARK-40380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kris Mok >Priority: Major > > SPARK-37907 added constant-folding support to the {{InvokeLike}} family of > expressions. Unfortunately it introduced a regression for cases when a > constant-folded {{InvokeLike}} expression returned a non-serializable result. > {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. > when using sparksql-scalapb on Spark 3.3.0+. > Below is a minimal repro to demonstrate this issue: > {code:scala} > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute > import org.apache.spark.sql.catalyst.expressions.Literal > import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, > StaticInvoke} > import org.apache.spark.sql.types.{LongType, ObjectType} > class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = > longVal + other } > case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): > NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } > val litExpr = Literal.fromObject(SerializableBoxedLong(42L), > ObjectType(classOf[SerializableBoxedLong])) > val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", > ObjectType(classOf[NotSerializableBoxedLong])) > val addExpr = Invoke(toNotSerializableExpr, "add", LongType, > Seq(UnresolvedAttribute.quotedString("id"))) > val df = spark.range(2).select(new Column(addExpr)) > df.collect > {code} > Before SPARK-37907, this example would run fine and result in {{[[42], > [43]]}}. But after SPARK-37907, it'd fail with: > {code:none} > ... > Caused by: java.io.NotSerializableException: NotSerializableBoxedLong > Serialization stack: > - object not serializable (class: NotSerializableBoxedLong, value: > NotSerializableBoxedLong@71231636) > - element of array (index: 1) > - array (class [Ljava.lang.Object;, size 2) > - element of array (index: 1) > - array (class [Ljava.lang.Object;, size 3) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class > org.apache.spark.sql.execution.WholeStageCodegenExec, > functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=3]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class > org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, > org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40381) Support standalone worker recommission
[ https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40381: Assignee: (was: Apache Spark) > Support standalone worker recommission > -- > > Key: SPARK-40381 > URL: https://issues.apache.org/jira/browse/SPARK-40381 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, spark standalone only support kill workers. We may want to > recommission some workers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40381) Support standalone worker recommission
[ https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601468#comment-17601468 ] Apache Spark commented on SPARK-40381: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/37822 > Support standalone worker recommission > -- > > Key: SPARK-40381 > URL: https://issues.apache.org/jira/browse/SPARK-40381 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, spark standalone only support kill workers. We may want to > recommission some workers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40381) Support standalone worker recommission
[ https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601469#comment-17601469 ] Apache Spark commented on SPARK-40381: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/37822 > Support standalone worker recommission > -- > > Key: SPARK-40381 > URL: https://issues.apache.org/jira/browse/SPARK-40381 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, spark standalone only support kill workers. We may want to > recommission some workers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40381) Support standalone worker recommission
[ https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40381: Assignee: Apache Spark > Support standalone worker recommission > -- > > Key: SPARK-40381 > URL: https://issues.apache.org/jira/browse/SPARK-40381 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > Currently, spark standalone only support kill workers. We may want to > recommission some workers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40381) Support standalone worker recommission
Zhongwei Zhu created SPARK-40381: Summary: Support standalone worker recommission Key: SPARK-40381 URL: https://issues.apache.org/jira/browse/SPARK-40381 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 3.3.0 Reporter: Zhongwei Zhu Currently, spark standalone only support kill workers. We may want to recommission some workers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result
Kris Mok created SPARK-40380: Summary: Constant-folding of InvokeLike should not result in non-serializable result Key: SPARK-40380 URL: https://issues.apache.org/jira/browse/SPARK-40380 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kris Mok SPARK-37907 added constant-folding support to the {{InvokeLike}} family of expressions. Unfortunately it introduced a regression for cases when a constant-folded {{InvokeLike}} expression returned a non-serializable result. {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. when using sparksql-scalapb on Spark 3.3.0+. Below is a minimal repro to demonstrate this issue: {code:scala} import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.Literal import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke} import org.apache.spark.sql.types.{LongType, ObjectType} class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = longVal + other } case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } val litExpr = Literal.fromObject(SerializableBoxedLong(42L), ObjectType(classOf[SerializableBoxedLong])) val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", ObjectType(classOf[NotSerializableBoxedLong])) val addExpr = Invoke(toNotSerializableExpr, "add", LongType, Seq(UnresolvedAttribute.quotedString("id"))) val df = spark.range(2).select(new Column(addExpr)) df.collect {code} Before SPARK-37907, this example would run fine and result in {{[[42], [43]]}}. But after SPARK-37907, it'd fail with: {code:none} ... Caused by: java.io.NotSerializableException: NotSerializableBoxedLong Serialization stack: - object not serializable (class: NotSerializableBoxedLong, value: NotSerializableBoxedLong@71231636) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 2) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 3) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
[ https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40379: Assignee: Apache Spark (was: Holden Karau) > Propagate decommission executor loss reason during onDisconnect in K8s > -- > > Key: SPARK-40379 > URL: https://issues.apache.org/jira/browse/SPARK-40379 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > > Currently if an executor has been sent a decommission message and then it > disconnects from the scheduler we only disable the executor depending on the > K8s status events to drive the rest of the state transitions. However, the > K8s status events can become overwhelmed on large clusters so we should check > if an executor is in a decommissioning state when it is disconnected and use > that reason instead of waiting on the K8s status events so we have more > accurate logging information. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
[ https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601386#comment-17601386 ] Apache Spark commented on SPARK-40379: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/37821 > Propagate decommission executor loss reason during onDisconnect in K8s > -- > > Key: SPARK-40379 > URL: https://issues.apache.org/jira/browse/SPARK-40379 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Currently if an executor has been sent a decommission message and then it > disconnects from the scheduler we only disable the executor depending on the > K8s status events to drive the rest of the state transitions. However, the > K8s status events can become overwhelmed on large clusters so we should check > if an executor is in a decommissioning state when it is disconnected and use > that reason instead of waiting on the K8s status events so we have more > accurate logging information. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
[ https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40379: Assignee: Holden Karau (was: Apache Spark) > Propagate decommission executor loss reason during onDisconnect in K8s > -- > > Key: SPARK-40379 > URL: https://issues.apache.org/jira/browse/SPARK-40379 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Currently if an executor has been sent a decommission message and then it > disconnects from the scheduler we only disable the executor depending on the > K8s status events to drive the rest of the state transitions. However, the > K8s status events can become overwhelmed on large clusters so we should check > if an executor is in a decommissioning state when it is disconnected and use > that reason instead of waiting on the K8s status events so we have more > accurate logging information. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
[ https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601385#comment-17601385 ] Apache Spark commented on SPARK-40379: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/37821 > Propagate decommission executor loss reason during onDisconnect in K8s > -- > > Key: SPARK-40379 > URL: https://issues.apache.org/jira/browse/SPARK-40379 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Currently if an executor has been sent a decommission message and then it > disconnects from the scheduler we only disable the executor depending on the > K8s status events to drive the rest of the state transitions. However, the > K8s status events can become overwhelmed on large clusters so we should check > if an executor is in a decommissioning state when it is disconnected and use > that reason instead of waiting on the K8s status events so we have more > accurate logging information. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40295) Allow v2 functions with literal args in write distribution and ordering
[ https://issues.apache.org/jira/browse/SPARK-40295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-40295. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37749 [https://github.com/apache/spark/pull/37749] > Allow v2 functions with literal args in write distribution and ordering > --- > > Key: SPARK-40295 > URL: https://issues.apache.org/jira/browse/SPARK-40295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.4.0 > > > Spark should allow v2 function with literal args in write distribution and > ordering. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40295) Allow v2 functions with literal args in write distribution and ordering
[ https://issues.apache.org/jira/browse/SPARK-40295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-40295: Assignee: Anton Okolnychyi > Allow v2 functions with literal args in write distribution and ordering > --- > > Key: SPARK-40295 > URL: https://issues.apache.org/jira/browse/SPARK-40295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > Spark should allow v2 function with literal args in write distribution and > ordering. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sharma updated SPARK-40378: -- Summary: What is React Native. (was: React Native is an open source framework for building mobile apps. It was created by Facebook and is designed for cross-platform capability.) > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.3 >Reporter: Nikhil Sharma >Priority: Major > Fix For: 3.1.3 > > > React Native is an open-source framework for building mobile apps. It was > created by Facebook and is designed for cross-platform capability. It can be > tough to choose between an excellent user experience, a beautiful user > interface, and fast processing, but [React Native online > course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] > makes that decision an easy one with powerful native development. Jordan > Walke found a way to generate UI elements from a javascript thread and > applied it to iOS to build the first native application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
Holden Karau created SPARK-40379: Summary: Propagate decommission executor loss reason during onDisconnect in K8s Key: SPARK-40379 URL: https://issues.apache.org/jira/browse/SPARK-40379 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Core Affects Versions: 3.4.0 Reporter: Holden Karau Assignee: Holden Karau Currently if an executor has been sent a decommission message and then it disconnects from the scheduler we only disable the executor depending on the K8s status events to drive the rest of the state transitions. However, the K8s status events can become overwhelmed on large clusters so we should check if an executor is in a decommissioning state when it is disconnected and use that reason instead of waiting on the K8s status events so we have more accurate logging information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40378) React Native is an open source framework for building mobile apps. It was created by Facebook and is designed for cross-platform capability.
Nikhil Sharma created SPARK-40378: - Summary: React Native is an open source framework for building mobile apps. It was created by Facebook and is designed for cross-platform capability. Key: SPARK-40378 URL: https://issues.apache.org/jira/browse/SPARK-40378 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 3.0.3 Reporter: Nikhil Sharma Fix For: 3.1.3 React Native is an open-source framework for building mobile apps. It was created by Facebook and is designed for cross-platform capability. It can be tough to choose between an excellent user experience, a beautiful user interface, and fast processing, but [React Native online course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] makes that decision an easy one with powerful native development. Jordan Walke found a way to generate UI elements from a javascript thread and applied it to iOS to build the first native application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-40149: Fix Version/s: 3.2.3 > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list
[ https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601358#comment-17601358 ] Apache Spark commented on SPARK-38961: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37820 > Enhance to automatically generate the pandas API support list > - > > Key: SPARK-38961 > URL: https://issues.apache.org/jira/browse/SPARK-38961 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > Currently, the supported pandas API list is manually maintained, so it would > be better to make the list automatically generated to reduce the maintenance > cost. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list
[ https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601359#comment-17601359 ] Apache Spark commented on SPARK-38961: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37820 > Enhance to automatically generate the pandas API support list > - > > Key: SPARK-38961 > URL: https://issues.apache.org/jira/browse/SPARK-38961 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > Currently, the supported pandas API list is manually maintained, so it would > be better to make the list automatically generated to reduce the maintenance > cost. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601311#comment-17601311 ] Elhoussine Talab commented on SPARK-40376: -- Thanks [~srowen] > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Priority: Trivial > > Using `np.bool` generates this warning: > {quote} > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > {quote} > > See Numpy's deprecation statement here: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40376: - Issue Type: Improvement (was: Bug) Priority: Trivial (was: Major) This is not a bug > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Priority: Trivial > > Using `np.bool` generates this warning: > {quote} > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > {quote} > > See Numpy's deprecation statement here: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows
[ https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40377: Assignee: (was: Apache Spark) > Allow customize maxBroadcastTableBytes and maxBroadcastRows > --- > > Key: SPARK-40377 > URL: https://issues.apache.org/jira/browse/SPARK-40377 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: LLiu >Priority: Major > Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png > > > Recently, we encountered some driver OOM problems. Some tables with large > data volume were compressed using Snappy and then broadcast join was > performed, but the actual data volume was too large, which resulted in driver > OOM. > The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB > and 51200 respectively. Maybe we can allow customization of these values, > configure smaller values according to different scenarios, and prohibit > broadcast joins for tables with large data volumes to avoid driver OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows
[ https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40377: Assignee: Apache Spark > Allow customize maxBroadcastTableBytes and maxBroadcastRows > --- > > Key: SPARK-40377 > URL: https://issues.apache.org/jira/browse/SPARK-40377 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: LLiu >Assignee: Apache Spark >Priority: Major > Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png > > > Recently, we encountered some driver OOM problems. Some tables with large > data volume were compressed using Snappy and then broadcast join was > performed, but the actual data volume was too large, which resulted in driver > OOM. > The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB > and 51200 respectively. Maybe we can allow customization of these values, > configure smaller values according to different scenarios, and prohibit > broadcast joins for tables with large data volumes to avoid driver OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows
[ https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601297#comment-17601297 ] Apache Spark commented on SPARK-40377: -- User 'LLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/37819 > Allow customize maxBroadcastTableBytes and maxBroadcastRows > --- > > Key: SPARK-40377 > URL: https://issues.apache.org/jira/browse/SPARK-40377 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: LLiu >Priority: Major > Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png > > > Recently, we encountered some driver OOM problems. Some tables with large > data volume were compressed using Snappy and then broadcast join was > performed, but the actual data volume was too large, which resulted in driver > OOM. > The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB > and 51200 respectively. Maybe we can allow customization of these values, > configure smaller values according to different scenarios, and prohibit > broadcast joins for tables with large data volumes to avoid driver OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601293#comment-17601293 ] Karl Håkan Nordgren commented on SPARK-38218: - [~hyukjin.kwon] : Could I ask for an update? The ticket is marked as "Resolved" but the issue persists here https://spark.apache.org/downloads.html > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg, > image-2022-02-16-12-26-32-871.png > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows
[ https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LLiu updated SPARK-40377: - Attachment: 截屏2022-09-07 20.40.06.png 截屏2022-09-07 20.40.16.png > Allow customize maxBroadcastTableBytes and maxBroadcastRows > --- > > Key: SPARK-40377 > URL: https://issues.apache.org/jira/browse/SPARK-40377 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: LLiu >Priority: Major > Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png > > > Recently, we encountered some driver OOM problems. Some tables with large > data volume were compressed using Snappy and then broadcast join was > performed, but the actual data volume was too large, which resulted in driver > OOM. > The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB > and 51200 respectively. Maybe we can allow customization of these values, > configure smaller values according to different scenarios, and prohibit > broadcast joins for tables with large data volumes to avoid driver OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows
LLiu created SPARK-40377: Summary: Allow customize maxBroadcastTableBytes and maxBroadcastRows Key: SPARK-40377 URL: https://issues.apache.org/jira/browse/SPARK-40377 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: LLiu Recently, we encountered some driver OOM problems. Some tables with large data volume were compressed using Snappy and then broadcast join was performed, but the actual data volume was too large, which resulted in driver OOM. The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB and 51200 respectively. Maybe we can allow customization of these values, configure smaller values according to different scenarios, and prohibit broadcast joins for tables with large data volumes to avoid driver OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key
[ https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40185: --- Assignee: Vitalii Li > Remove column suggestion when the candidate list is empty for unresolved > column/attribute/map key > - > > Key: SPARK-40185 > URL: https://issues.apache.org/jira/browse/SPARK-40185 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > > For unresolved column, attribute or map key an error message might contain > suggestions from the list. However, when the list is empty the error message > looks incomplete: > `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot > be resolved. Did you mean one of the following? []` > This issue is to make final suggestion to show only if suggestion list is non > empty. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key
[ https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40185. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37621 [https://github.com/apache/spark/pull/37621] > Remove column suggestion when the candidate list is empty for unresolved > column/attribute/map key > - > > Key: SPARK-40185 > URL: https://issues.apache.org/jira/browse/SPARK-40185 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > Fix For: 3.4.0 > > > For unresolved column, attribute or map key an error message might contain > suggestions from the list. However, when the list is empty the error message > looks incomplete: > `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot > be resolved. Did you mean one of the following? []` > This issue is to make final suggestion to show only if suggestion list is non > empty. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601258#comment-17601258 ] Apache Spark commented on SPARK-40149: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37818 > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > Fix For: 3.4.0, 3.3.1 > > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601257#comment-17601257 ] Apache Spark commented on SPARK-40149: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37818 > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > Fix For: 3.4.0, 3.3.1 > > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40376: Assignee: Apache Spark > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Assignee: Apache Spark >Priority: Major > > Using `np.bool` generates this warning: > ``` > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations > ``` > > See Numpy's deprecation statement here: > https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elhoussine Talab updated SPARK-40376: - Description: Using `np.bool` generates this warning: {quote} UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation. 3070E `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. 3071E Deprecated in NumPy 1.20; for more details and guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] {quote} See Numpy's deprecation statement here: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] was: Using `np.bool` generates this warning: ``` UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation. 3070E `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. 3071E Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` See Numpy's deprecation statement here: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Priority: Major > > Using `np.bool` generates this warning: > {quote} > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > {quote} > > See Numpy's deprecation statement here: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601256#comment-17601256 ] Apache Spark commented on SPARK-40376: -- User 'ELHoussineT' has created a pull request for this issue: https://github.com/apache/spark/pull/37817 > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Priority: Major > > Using `np.bool` generates this warning: > {quote} > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > {quote} > > See Numpy's deprecation statement here: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40376: Assignee: (was: Apache Spark) > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Priority: Major > > Using `np.bool` generates this warning: > ``` > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations > ``` > > See Numpy's deprecation statement here: > https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40376) `np.bool` will be deprecated
[ https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601255#comment-17601255 ] Apache Spark commented on SPARK-40376: -- User 'ELHoussineT' has created a pull request for this issue: https://github.com/apache/spark/pull/37817 > `np.bool` will be deprecated > > > Key: SPARK-40376 > URL: https://issues.apache.org/jira/browse/SPARK-40376 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Elhoussine Talab >Priority: Major > > Using `np.bool` generates this warning: > ``` > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > 3070E `np.bool` is a deprecated alias for the builtin > `bool`. To silence this warning, use `bool` by itself. Doing this will not > modify any behavior and is safe. If you specifically wanted the numpy scalar > type, use `np.bool_` here. > 3071E Deprecated in NumPy 1.20; for more details and > guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations > ``` > > See Numpy's deprecation statement here: > https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40376) `np.bool` will be deprecated
Elhoussine Talab created SPARK-40376: Summary: `np.bool` will be deprecated Key: SPARK-40376 URL: https://issues.apache.org/jira/browse/SPARK-40376 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.3.0 Reporter: Elhoussine Talab Using `np.bool` generates this warning: ``` UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation. 3070E `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. 3071E Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` See Numpy's deprecation statement here: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40149. - Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37758 [https://github.com/apache/spark/pull/37758] > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > Fix For: 3.3.1, 3.4.0 > > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601228#comment-17601228 ] Apache Spark commented on SPARK-40332: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37816 > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40332: Assignee: Apache Spark > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601226#comment-17601226 ] Apache Spark commented on SPARK-40332: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37816 > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40332: Assignee: (was: Apache Spark) > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601225#comment-17601225 ] todd commented on SPARK-40298: -- [~dongjoon] I use aws spot cluster to terminate the instance where the pod is located during the shuffle read phase of the spark stage. The executed task will throw FetchFailed and ExecutorLostFailure exceptions. I hope that by reusing the PVC, only the current failed task will be recalculated, not the previous stage. We use this feature to avoid spark task recalculation when the aws spot cluster recycles machines, thereby saving computing costs. > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40373) Implement `ps.show_versions`
Haejoon Lee created SPARK-40373: --- Summary: Implement `ps.show_versions` Key: SPARK-40373 URL: https://issues.apache.org/jira/browse/SPARK-40373 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We want to have `ps.show_versions` to reach the pandas parity. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.show_versions.html. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40374) Migrate type check failures of type creators onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40374: - Description: Replace TypeCheckFailure by DataTypeMismatch in type checks in the complex type creator expressions: 1. CreateMap(3): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L205-L214 2. CreateNamedStruct(3): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L445-L457 3. UpdateFields(2): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L670-L673 was: Replace TypeCheckFailure by DataTypeMismatch in type checks in collection expressions: 1. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 2. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 3. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 4. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 5. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 6. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 7. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 8. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 9. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 > Migrate type check failures of type creators onto error classes > --- > > Key: SPARK-40374 > URL: https://issues.apache.org/jira/browse/SPARK-40374 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the complex > type creator expressions: > 1. CreateMap(3): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L205-L214 > 2. CreateNamedStruct(3): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L445-L457 > 3. UpdateFields(2): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L670-L673 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40375) Implement `spark.show_versions`
Haejoon Lee created SPARK-40375: --- Summary: Implement `spark.show_versions` Key: SPARK-40375 URL: https://issues.apache.org/jira/browse/SPARK-40375 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0 Reporter: Haejoon Lee We might want to have `spark.show_versions` to provide useful environment informations similar to [https://pandas.pydata.org/docs/reference/api/pandas.show_versions.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40374) Migrate type check failures of type creators onto error classes
Max Gekk created SPARK-40374: Summary: Migrate type check failures of type creators onto error classes Key: SPARK-40374 URL: https://issues.apache.org/jira/browse/SPARK-40374 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace TypeCheckFailure by DataTypeMismatch in type checks in collection expressions: 1. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 2. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 3. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 4. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 5. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 6. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 7. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 8. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 9. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org