date:20220907

[jira] [Created] (SPARK-40388) SQL configuration spark.sql.mapKeyDedupPolicy not always applied

2022-09-07 Thread Paul Praet (Jira)

Paul Praet created SPARK-40388:
--

 Summary: SQL configuration spark.sql.mapKeyDedupPolicy not always 
applied
 Key: SPARK-40388
 URL: https://issues.apache.org/jira/browse/SPARK-40388
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Paul Praet


I have set spark.sql.mapKeyDedupPolicy to LAST_WIN.

However, I had still one failure where I got
{quote}Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 7 in stage 1201.0 failed 4 times, most recent failure: Lost task 
7.3 in stage 1201.0 (TID 1011313) (ip-10-1-34-47.eu-west-1.compute.internal 
executor 228): java.lang.RuntimeException: Duplicate map key domain was found, 
please check the input data. If you want to remove the duplicated keys, you can 
set spark.sql.mapKeyDedupPolicy to LAST_WIN so that the key inserted at last 
takes precedence.
{quote}
We are confident we set the right configuration in SparkConf (we can find it on 
the Spark UI -> Environment).

It is our impression this configuration is not propagated reliably to the 
executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40354) Support eliminate dynamic partition for v1 writes

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40354:


Assignee: Apache Spark

> Support eliminate dynamic partition for v1 writes
> -
>
> Key: SPARK-40354
> URL: https://issues.apache.org/jira/browse/SPARK-40354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> v1 writes will add an extra sort for dynamic columns, e.g.
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p)
> SELECT c1, c2, 'a' as p FROM t2 {code}
> if the dynamic columns are foldable, we can optimize it to:
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p='a')
> SELECT c1, c2 FROM t2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40354) Support eliminate dynamic partition for v1 writes

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40354:


Assignee: (was: Apache Spark)

> Support eliminate dynamic partition for v1 writes
> -
>
> Key: SPARK-40354
> URL: https://issues.apache.org/jira/browse/SPARK-40354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> v1 writes will add an extra sort for dynamic columns, e.g.
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p)
> SELECT c1, c2, 'a' as p FROM t2 {code}
> if the dynamic columns are foldable, we can optimize it to:
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p='a')
> SELECT c1, c2 FROM t2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40354) Support eliminate dynamic partition for v1 writes

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601673#comment-17601673
 ] 

Apache Spark commented on SPARK-40354:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37831

> Support eliminate dynamic partition for v1 writes
> -
>
> Key: SPARK-40354
> URL: https://issues.apache.org/jira/browse/SPARK-40354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> v1 writes will add an extra sort for dynamic columns, e.g.
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p)
> SELECT c1, c2, 'a' as p FROM t2 {code}
> if the dynamic columns are foldable, we can optimize it to:
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p='a')
> SELECT c1, c2 FROM t2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40387) Improve the implementation of Spark Decimal

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40387:


Assignee: (was: Apache Spark)

> Improve the implementation of Spark Decimal
> ---
>
> Key: SPARK-40387
> URL: https://issues.apache.org/jira/browse/SPARK-40387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal always use ne first, but eq is better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40387) Improve the implementation of Spark Decimal

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40387:


Assignee: Apache Spark

> Improve the implementation of Spark Decimal
> ---
>
> Key: SPARK-40387
> URL: https://issues.apache.org/jira/browse/SPARK-40387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Spark Decimal always use ne first, but eq is better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40387) Improve the implementation of Spark Decimal

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601672#comment-17601672
 ] 

Apache Spark commented on SPARK-40387:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37830

> Improve the implementation of Spark Decimal
> ---
>
> Key: SPARK-40387
> URL: https://issues.apache.org/jira/browse/SPARK-40387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal always use ne first, but eq is better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40387) Improve the implementation of Spark Decimal

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601671#comment-17601671
 ] 

Apache Spark commented on SPARK-40387:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37830

> Improve the implementation of Spark Decimal
> ---
>
> Key: SPARK-40387
> URL: https://issues.apache.org/jira/browse/SPARK-40387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal always use ne first, but eq is better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40387) Improve the implementation of Spark Decimal

2022-09-07 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-40387:
--

 Summary: Improve the implementation of Spark Decimal
 Key: SPARK-40387
 URL: https://issues.apache.org/jira/browse/SPARK-40387
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Spark Decimal always use ne first, but eq is better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40386) Implement `ddof` in `DataFrame.cov`

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40386:


Assignee: (was: Apache Spark)

> Implement `ddof` in `DataFrame.cov`
> ---
>
> Key: SPARK-40386
> URL: https://issues.apache.org/jira/browse/SPARK-40386
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40386) Implement `ddof` in `DataFrame.cov`

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601651#comment-17601651
 ] 

Apache Spark commented on SPARK-40386:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37829

> Implement `ddof` in `DataFrame.cov`
> ---
>
> Key: SPARK-40386
> URL: https://issues.apache.org/jira/browse/SPARK-40386
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40386) Implement `ddof` in `DataFrame.cov`

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40386:


Assignee: Apache Spark

> Implement `ddof` in `DataFrame.cov`
> ---
>
> Key: SPARK-40386
> URL: https://issues.apache.org/jira/browse/SPARK-40386
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40386) Implement `ddof` in `DataFrame.cov`

2022-09-07 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-40386:
-

 Summary: Implement `ddof` in `DataFrame.cov`
 Key: SPARK-40386
 URL: https://issues.apache.org/jira/browse/SPARK-40386
 Project: Spark
  Issue Type: Sub-task
  Components: ps, SQL
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40385) Classes with companion object constructor fails interpreted path

2022-09-07 Thread Emil Ejbyfeldt (Jira)

Emil Ejbyfeldt created SPARK-40385:
--

 Summary: Classes with companion object constructor fails 
interpreted path
 Key: SPARK-40385
 URL: https://issues.apache.org/jira/browse/SPARK-40385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.2, 3.3.0, 3.1.3
Reporter: Emil Ejbyfeldt


The Encoder implemented in SPARK-8288 for classes with only a companion object 
constructor fails when using the interpreted path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40384:


Assignee: (was: Apache Spark)

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601633#comment-17601633
 ] 

Apache Spark commented on SPARK-40384:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37828

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601632#comment-17601632
 ] 

Apache Spark commented on SPARK-40384:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37828

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40384:


Assignee: Apache Spark

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-07 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40384:
---

 Summary: Do base image real in time build only when infra 
dockerfile is changed
 Key: SPARK-40384
 URL: https://issues.apache.org/jira/browse/SPARK-40384
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40364) Unify `initDB` method in `DBProvider`

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601622#comment-17601622
 ] 

Apache Spark commented on SPARK-40364:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37826

> Unify `initDB` method in `DBProvider`
> -
>
> Key: SPARK-40364
> URL: https://issues.apache.org/jira/browse/SPARK-40364
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are 2 `initDB` in `DBProvider`
>  # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
> ObjectMapper mapper)
>  # DB initDB(DBBackend dbBackend, File file)
> and the second one only used by test ShuffleTestAccessor, we can explore 
> whether can keep only the first method
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40364) Unify `initDB` method in `DBProvider`

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601620#comment-17601620
 ] 

Apache Spark commented on SPARK-40364:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37826

> Unify `initDB` method in `DBProvider`
> -
>
> Key: SPARK-40364
> URL: https://issues.apache.org/jira/browse/SPARK-40364
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are 2 `initDB` in `DBProvider`
>  # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
> ObjectMapper mapper)
>  # DB initDB(DBBackend dbBackend, File file)
> and the second one only used by test ShuffleTestAccessor, we can explore 
> whether can keep only the first method
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40383:


Assignee: (was: Apache Spark)

> Pin mypy ==0.920 in dev/requirements.txt
> 
>
> Key: SPARK-40383
> URL: https://issues.apache.org/jira/browse/SPARK-40383
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40383:


Assignee: Apache Spark

> Pin mypy ==0.920 in dev/requirements.txt
> 
>
> Key: SPARK-40383
> URL: https://issues.apache.org/jira/browse/SPARK-40383
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40364) Unify `initDB` method in `DBProvider`

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40364:


Assignee: Apache Spark

> Unify `initDB` method in `DBProvider`
> -
>
> Key: SPARK-40364
> URL: https://issues.apache.org/jira/browse/SPARK-40364
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> There are 2 `initDB` in `DBProvider`
>  # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
> ObjectMapper mapper)
>  # DB initDB(DBBackend dbBackend, File file)
> and the second one only used by test ShuffleTestAccessor, we can explore 
> whether can keep only the first method
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601621#comment-17601621
 ] 

Apache Spark commented on SPARK-40383:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37827

> Pin mypy ==0.920 in dev/requirements.txt
> 
>
> Key: SPARK-40383
> URL: https://issues.apache.org/jira/browse/SPARK-40383
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40364) Unify `initDB` method in `DBProvider`

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40364:


Assignee: (was: Apache Spark)

> Unify `initDB` method in `DBProvider`
> -
>
> Key: SPARK-40364
> URL: https://issues.apache.org/jira/browse/SPARK-40364
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are 2 `initDB` in `DBProvider`
>  # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
> ObjectMapper mapper)
>  # DB initDB(DBBackend dbBackend, File file)
> and the second one only used by test ShuffleTestAccessor, we can explore 
> whether can keep only the first method
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40383) Pin mypy ==0.920 in dev/requirements.txt

2022-09-07 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-40383:
-

 Summary: Pin mypy ==0.920 in dev/requirements.txt
 Key: SPARK-40383
 URL: https://issues.apache.org/jira/browse/SPARK-40383
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40186) mergedShuffleCleaner should have been shutdown before db closed

2022-09-07 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-40186.
-
Target Version/s: 3.4.0
Assignee: Yang Jie
  Resolution: Fixed

> mergedShuffleCleaner should have been shutdown before db closed
> ---
>
> Key: SPARK-40186
> URL: https://issues.apache.org/jira/browse/SPARK-40186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> Should ensure `RemoteBlockPushResolver#mergedShuffleCleaner` have been 
> shutdown before `RemoteBlockPushResolver#db` closed, otherwise, 
> `RemoteBlockPushResolver#applicationRemoved` may perform delete operations on 
> a closed db.
>  
> https://github.com/apache/spark/pull/37610#discussion_r951185256
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-07 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-40362:
-
Description: 
In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize, for example 
BinaryComparison 

Consider following expression:

a + b > 10

         GT

            |

a + b          10

The BinaryComparison operator in the precanonicalize, first precanonicalizes  
children & then   may swap operands based on left /right hashCode inequality..

lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted to 
LT

But If the same tree is created 

           GT

            |

 b + a      10

The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
for this tree

 Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.

Thus to similar trees result in different canonicalization , one having GT 
other having LT 

 

The problem occurs because  for commutative expressions the canonicalization 
normalizes the expression with consistent hashCode which is not the case with 
precanonicalize as the hashCode of commutative expression 's precanonicalize 
and post canonicalize are different.

 

 

The test 
{quote}test("bug X")
Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    val 
y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
Unknown macro}
).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}
This test fails.

The fix which I propose is that for the commutative expressions, the 
precanonicalize should be overridden and  
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize. effectively for commutative operands ( add, or , 
multiply , and etc) canonicalize and precanonicalize should be same.

PR:

[https://github.com/apache/spark/pull/37824]

 

 

I am also trying a better fix, where by the idea is that for commutative 
expressions the murmur hashCode are caluculated using unorderedHash so that it 
is order  independent ( i.e symmetric).

The above approach works fine , but in case of Least & Greatest, the Product's 
element is  a Seq,  and that messes with consistency of hashCode.

  was:
In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize.

Consider following expression:

a + b > 10

This GreaterThan expression when canonicalized as a whole for first time,  will 
skip the call to Canonicalize.reorderCommutativeOperators for the Add 
expression as the GreaterThan's canonicalization used precanonicalize on 
children ( the Add expression).

 

so if create a new expression 

b + a > 10 and invoke canonicalize it, the canonicalized versions of these two 
expressions will not match.

The test 
{quote}test("bug X") {
    val tr1 = LocalRelation('c.int, 'b.string, 'a.int)

   val y = tr1.where('a.attr + 'c.attr > 10).analyze
   val fullCond = y.asInstanceOf[Filter].condition.clone()
  val addExpr = (fullCond match
Unknown macro: \{     case GreaterThan(x}
).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}
This test fails.

The fix which I propose is that for the commutative expressions, the 
precanonicalize should be overridden and  
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize. effectively for commutative operands ( add, or , 
multiply , and etc) canonicalize and precanonicalize should be same.



PR:

https://github.com/apache/spark/pull/37824


> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>

[jira] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-07 Thread Asif (Jira)



[ https://issues.apache.org/jira/browse/SPARK-40362 ]


Asif deleted comment on SPARK-40362:
--

was (Author: ashahid7):
I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 

trait CommutativeExpresionCanonicalization {
    this: Expression =>
    override lazy val canonicalized: Expression = preCanonicalized
    override lazy val preCanonicalized: Expression = {
       val canonicalizedChildren = children.map(_.preCanonicalized)
        Canonicalize.reorderCommutativeOperators {
           withNewChildren(canonicalizedChildren)
       }
    }
}

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-07 Thread Asif (Jira)



[ https://issues.apache.org/jira/browse/SPARK-40362 ]


Asif deleted comment on SPARK-40362:
--

was (Author: ashahid7):
will be generating a PR tomorrow..

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.9.3

2022-09-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40365.
--
Resolution: Fixed

Issue resolved by pull request 37814
[https://github.com/apache/spark/pull/37814]

> Bump ANTLR runtime version from 4.8 to 4.9.3
> 
>
> Key: SPARK-40365
> URL: https://issues.apache.org/jira/browse/SPARK-40365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.9.3

2022-09-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40365:


Assignee: BingKun Pan

> Bump ANTLR runtime version from 4.8 to 4.9.3
> 
>
> Key: SPARK-40365
> URL: https://issues.apache.org/jira/browse/SPARK-40365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40378) What is React Native.

2022-09-07 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-40378.
-
Resolution: Invalid

> What is React Native.
> -
>
> Key: SPARK-40378
> URL: https://issues.apache.org/jira/browse/SPARK-40378
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.3
>Reporter: Nikhil Sharma
>Priority: Major
>
> React Native is an open-source framework for building mobile apps. It was 
> created by Facebook and is designed for cross-platform capability. It can be 
> tough to choose between an excellent user experience, a beautiful user 
> interface, and fast processing, but [React Native online 
> course|https://www.igmguru.com/digital-marketing-programming/react-native-training/]
>  makes that decision an easy one with powerful native development. Jordan 
> Walke found a way to generate UI elements from a javascript thread and 
> applied it to iOS to build the first native application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40378) What is React Native.

2022-09-07 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40378:

Target Version/s:   (was: 3.1.2)

> What is React Native.
> -
>
> Key: SPARK-40378
> URL: https://issues.apache.org/jira/browse/SPARK-40378
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.3
>Reporter: Nikhil Sharma
>Priority: Major
>
> React Native is an open-source framework for building mobile apps. It was 
> created by Facebook and is designed for cross-platform capability. It can be 
> tough to choose between an excellent user experience, a beautiful user 
> interface, and fast processing, but [React Native online 
> course|https://www.igmguru.com/digital-marketing-programming/react-native-training/]
>  makes that decision an easy one with powerful native development. Jordan 
> Walke found a way to generate UI elements from a javascript thread and 
> applied it to iOS to build the first native application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40378) What is React Native.

2022-09-07 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40378:

Fix Version/s: (was: 3.1.3)

> What is React Native.
> -
>
> Key: SPARK-40378
> URL: https://issues.apache.org/jira/browse/SPARK-40378
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.3
>Reporter: Nikhil Sharma
>Priority: Major
>
> React Native is an open-source framework for building mobile apps. It was 
> created by Facebook and is designed for cross-platform capability. It can be 
> tough to choose between an excellent user experience, a beautiful user 
> interface, and fast processing, but [React Native online 
> course|https://www.igmguru.com/digital-marketing-programming/react-native-training/]
>  makes that decision an easy one with powerful native development. Jordan 
> Walke found a way to generate UI elements from a javascript thread and 
> applied it to iOS to build the first native application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601554#comment-17601554
 ] 

Apache Spark commented on SPARK-40382:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/37825

> Reduce projections in Expand when multiple distinct aggregations have 
> semantically equivalent children
> --
>
> Key: SPARK-40382
> URL: https://issues.apache.org/jira/browse/SPARK-40382
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> In RewriteDistinctAggregates, when grouping aggregate expressions by function 
> children, we should treat children that are semantically equivalent as the 
> same.
> This proposed change potentially reduces the number of projections in the 
> Expand operator added to a plan. In some cases, it may eliminate the need for 
> an Expand operator.
> Example: In the following query, the Expand operator creates 3*n rows (where 
> n is the number of incoming rows) because it has a projection for function 
> children b + 1, 1 + b and c.
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 2, 3.0),
> (1, 3, 4.0),
> (2, 4, 2.5),
> (2, 3, 1.0)
> v1(a, b, c);
> select
>   a,
>   count(distinct b + 1),
>   avg(distinct 1 + b) filter (where c > 0),
>   sum(c)
> from
>   v1
> group by a;
> {noformat}
> The Expand operator has three projections (each producing a row for each 
> incoming row):
> {noformat}
> [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for 
> regular aggregation)
> [a#87, (b#88 + 1), null, 1, null, null],  <== projection #2 (for 
> distinct aggregation of b + 1)
> [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for 
> distinct aggregation of 1 + b)
> {noformat}
> In reality, the Expand only needs one projection for 1 + b and b + 1, because 
> they are semantically equivalent.
> With the proposed change, the Expand operator's projections look like this:
> {noformat}
> [a#67, null, 0, null, UnscaledValue(c#69)],  <== projection #1 (for regular 
> aggregations)
> [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]],  <== projection #2 (for distinct 
> aggregation on b + 1 and 1 + b)
> {noformat}
> With one less projection, Expand produces n*2 rows instead of n*3 rows, but 
> still produces the correct result.
> In the case where all distinct aggregates have semantically equivalent 
> children, the Expand operator is not needed at all.
> Assume this benchmark:
> {noformat}
> runBenchmark("distinct aggregates") {
>   val N = 20 << 22
>   val benchmark = new Benchmark("distinct aggregates", N, output = output)
>   spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1")
> .createOrReplaceTempView("test")
>   def f1(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 1 + id1),
> avg(distinct 1 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("all semantically equivalent", numIters = 2) { _ =>
> f1()
>   }
>   def f2(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 1 + id1),
> avg(distinct 2 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("some semantically equivalent", numIters = 2) { _ =>
> f2()
>   }
>   def f3(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 3 + id1),
> avg(distinct 2 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("none semantically equivalent", numIters = 2) { _ =>
> f3()
>   }
>   benchmark.run()
> }
>  {noformat}
> Before the change:
> {noformat}
> [info] distinct aggregates:  Best Time(ms)   Avg Time(ms) 
>   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> [info] 
> 
> [info] all semantically equivalent   14721  14859 
> 195  5.7 175.5   1.0X
> [info] some semantically equivalent  14569  14572 
>   5  5.8 173.7   1.0X
> [info] none semantically equivalent  14408  14488 
> 113  5.8 171.8   1.0X
> {noformat}
> After the propose

[jira] [Assigned] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40382:


Assignee: (was: Apache Spark)

> Reduce projections in Expand when multiple distinct aggregations have 
> semantically equivalent children
> --
>
> Key: SPARK-40382
> URL: https://issues.apache.org/jira/browse/SPARK-40382
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> In RewriteDistinctAggregates, when grouping aggregate expressions by function 
> children, we should treat children that are semantically equivalent as the 
> same.
> This proposed change potentially reduces the number of projections in the 
> Expand operator added to a plan. In some cases, it may eliminate the need for 
> an Expand operator.
> Example: In the following query, the Expand operator creates 3*n rows (where 
> n is the number of incoming rows) because it has a projection for function 
> children b + 1, 1 + b and c.
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 2, 3.0),
> (1, 3, 4.0),
> (2, 4, 2.5),
> (2, 3, 1.0)
> v1(a, b, c);
> select
>   a,
>   count(distinct b + 1),
>   avg(distinct 1 + b) filter (where c > 0),
>   sum(c)
> from
>   v1
> group by a;
> {noformat}
> The Expand operator has three projections (each producing a row for each 
> incoming row):
> {noformat}
> [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for 
> regular aggregation)
> [a#87, (b#88 + 1), null, 1, null, null],  <== projection #2 (for 
> distinct aggregation of b + 1)
> [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for 
> distinct aggregation of 1 + b)
> {noformat}
> In reality, the Expand only needs one projection for 1 + b and b + 1, because 
> they are semantically equivalent.
> With the proposed change, the Expand operator's projections look like this:
> {noformat}
> [a#67, null, 0, null, UnscaledValue(c#69)],  <== projection #1 (for regular 
> aggregations)
> [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]],  <== projection #2 (for distinct 
> aggregation on b + 1 and 1 + b)
> {noformat}
> With one less projection, Expand produces n*2 rows instead of n*3 rows, but 
> still produces the correct result.
> In the case where all distinct aggregates have semantically equivalent 
> children, the Expand operator is not needed at all.
> Assume this benchmark:
> {noformat}
> runBenchmark("distinct aggregates") {
>   val N = 20 << 22
>   val benchmark = new Benchmark("distinct aggregates", N, output = output)
>   spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1")
> .createOrReplaceTempView("test")
>   def f1(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 1 + id1),
> avg(distinct 1 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("all semantically equivalent", numIters = 2) { _ =>
> f1()
>   }
>   def f2(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 1 + id1),
> avg(distinct 2 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("some semantically equivalent", numIters = 2) { _ =>
> f2()
>   }
>   def f3(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 3 + id1),
> avg(distinct 2 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("none semantically equivalent", numIters = 2) { _ =>
> f3()
>   }
>   benchmark.run()
> }
>  {noformat}
> Before the change:
> {noformat}
> [info] distinct aggregates:  Best Time(ms)   Avg Time(ms) 
>   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> [info] 
> 
> [info] all semantically equivalent   14721  14859 
> 195  5.7 175.5   1.0X
> [info] some semantically equivalent  14569  14572 
>   5  5.8 173.7   1.0X
> [info] none semantically equivalent  14408  14488 
> 113  5.8 171.8   1.0X
> {noformat}
> After the proposed change:
> {noformat}
> [info] distinct aggregates:  Best Time(ms)   Avg Time(ms) 
>   Stdev(ms)

[jira] [Assigned] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40382:


Assignee: Apache Spark

> Reduce projections in Expand when multiple distinct aggregations have 
> semantically equivalent children
> --
>
> Key: SPARK-40382
> URL: https://issues.apache.org/jira/browse/SPARK-40382
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> In RewriteDistinctAggregates, when grouping aggregate expressions by function 
> children, we should treat children that are semantically equivalent as the 
> same.
> This proposed change potentially reduces the number of projections in the 
> Expand operator added to a plan. In some cases, it may eliminate the need for 
> an Expand operator.
> Example: In the following query, the Expand operator creates 3*n rows (where 
> n is the number of incoming rows) because it has a projection for function 
> children b + 1, 1 + b and c.
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 2, 3.0),
> (1, 3, 4.0),
> (2, 4, 2.5),
> (2, 3, 1.0)
> v1(a, b, c);
> select
>   a,
>   count(distinct b + 1),
>   avg(distinct 1 + b) filter (where c > 0),
>   sum(c)
> from
>   v1
> group by a;
> {noformat}
> The Expand operator has three projections (each producing a row for each 
> incoming row):
> {noformat}
> [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for 
> regular aggregation)
> [a#87, (b#88 + 1), null, 1, null, null],  <== projection #2 (for 
> distinct aggregation of b + 1)
> [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for 
> distinct aggregation of 1 + b)
> {noformat}
> In reality, the Expand only needs one projection for 1 + b and b + 1, because 
> they are semantically equivalent.
> With the proposed change, the Expand operator's projections look like this:
> {noformat}
> [a#67, null, 0, null, UnscaledValue(c#69)],  <== projection #1 (for regular 
> aggregations)
> [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]],  <== projection #2 (for distinct 
> aggregation on b + 1 and 1 + b)
> {noformat}
> With one less projection, Expand produces n*2 rows instead of n*3 rows, but 
> still produces the correct result.
> In the case where all distinct aggregates have semantically equivalent 
> children, the Expand operator is not needed at all.
> Assume this benchmark:
> {noformat}
> runBenchmark("distinct aggregates") {
>   val N = 20 << 22
>   val benchmark = new Benchmark("distinct aggregates", N, output = output)
>   spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1")
> .createOrReplaceTempView("test")
>   def f1(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 1 + id1),
> avg(distinct 1 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("all semantically equivalent", numIters = 2) { _ =>
> f1()
>   }
>   def f2(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 1 + id1),
> avg(distinct 2 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("some semantically equivalent", numIters = 2) { _ =>
> f2()
>   }
>   def f3(): Unit = spark.sql(
> """
>   select
> k,
> sum(distinct id1 + 1),
> count(distinct 3 + id1),
> avg(distinct 2 + ID1)
>   from
> test
>   group by k""").noop()
>   benchmark.addCase("none semantically equivalent", numIters = 2) { _ =>
> f3()
>   }
>   benchmark.run()
> }
>  {noformat}
> Before the change:
> {noformat}
> [info] distinct aggregates:  Best Time(ms)   Avg Time(ms) 
>   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> [info] 
> 
> [info] all semantically equivalent   14721  14859 
> 195  5.7 175.5   1.0X
> [info] some semantically equivalent  14569  14572 
>   5  5.8 173.7   1.0X
> [info] none semantically equivalent  14408  14488 
> 113  5.8 171.8   1.0X
> {noformat}
> After the proposed change:
> {noformat}
> [info] distinct aggregates:  Best Time(ms)   Avg

[jira] [Created] (SPARK-40382) Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children

2022-09-07 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-40382:
-

 Summary: Reduce projections in Expand when multiple distinct 
aggregations have semantically equivalent children
 Key: SPARK-40382
 URL: https://issues.apache.org/jira/browse/SPARK-40382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Bruce Robbins


In RewriteDistinctAggregates, when grouping aggregate expressions by function 
children, we should treat children that are semantically equivalent as the same.

This proposed change potentially reduces the number of projections in the 
Expand operator added to a plan. In some cases, it may eliminate the need for 
an Expand operator.

Example: In the following query, the Expand operator creates 3*n rows (where n 
is the number of incoming rows) because it has a projection for function 
children b + 1, 1 + b and c.

{noformat}
create or replace temp view v1 as
select * from values
(1, 2, 3.0),
(1, 3, 4.0),
(2, 4, 2.5),
(2, 3, 1.0)
v1(a, b, c);

select
  a,
  count(distinct b + 1),
  avg(distinct 1 + b) filter (where c > 0),
  sum(c)
from
  v1
group by a;
{noformat}
The Expand operator has three projections (each producing a row for each 
incoming row):
{noformat}
[a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for 
regular aggregation)
[a#87, (b#88 + 1), null, 1, null, null],  <== projection #2 (for 
distinct aggregation of b + 1)
[a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for 
distinct aggregation of 1 + b)
{noformat}
In reality, the Expand only needs one projection for 1 + b and b + 1, because 
they are semantically equivalent.

With the proposed change, the Expand operator's projections look like this:
{noformat}
[a#67, null, 0, null, UnscaledValue(c#69)],  <== projection #1 (for regular 
aggregations)
[a#67, (b#68 + 1), 1, (c#69 > 0.0), null]],  <== projection #2 (for distinct 
aggregation on b + 1 and 1 + b)
{noformat}
With one less projection, Expand produces n*2 rows instead of n*3 rows, but 
still produces the correct result.

In the case where all distinct aggregates have semantically equivalent 
children, the Expand operator is not needed at all.

Assume this benchmark:
{noformat}
runBenchmark("distinct aggregates") {
  val N = 20 << 22

  val benchmark = new Benchmark("distinct aggregates", N, output = output)
  spark.range(N).selectExpr("id % 100 as k", "id % 10 as id1")
.createOrReplaceTempView("test")

  def f1(): Unit = spark.sql(
"""
  select
k,
sum(distinct id1 + 1),
count(distinct 1 + id1),
avg(distinct 1 + ID1)
  from
test
  group by k""").noop()

  benchmark.addCase("all semantically equivalent", numIters = 2) { _ =>
f1()
  }

  def f2(): Unit = spark.sql(
"""
  select
k,
sum(distinct id1 + 1),
count(distinct 1 + id1),
avg(distinct 2 + ID1)
  from
test
  group by k""").noop()

  benchmark.addCase("some semantically equivalent", numIters = 2) { _ =>
f2()
  }

  def f3(): Unit = spark.sql(
"""
  select
k,
sum(distinct id1 + 1),
count(distinct 3 + id1),
avg(distinct 2 + ID1)
  from
test
  group by k""").noop()

  benchmark.addCase("none semantically equivalent", numIters = 2) { _ =>
f3()
  }
  benchmark.run()
}
 {noformat}
Before the change:
{noformat}
[info] distinct aggregates:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] all semantically equivalent   14721  14859   
  195  5.7 175.5   1.0X
[info] some semantically equivalent  14569  14572   
5  5.8 173.7   1.0X
[info] none semantically equivalent  14408  14488   
  113  5.8 171.8   1.0X
{noformat}
After the proposed change:
{noformat}
[info] distinct aggregates:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] all semantically equivalent3658   3692   
   49 22.9  43.6   1.0X
[info] some semantically equivalent   9124   9214   
  127  9.2 108.8   0.4X
[info] none semantically equivalent  14601  14777   
  2

[jira] [Comment Edited] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-07 Thread Mars (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601543#comment-17601543
 ] 

Mars edited comment on SPARK-40320 at 9/7/22 10:50 PM:
---

[~Ngone51] 
Shouldn't it bring up a new `receiveLoop()` to serve RPC messages?

Yes, my previous thinking was wrong. I remote debug on Executor and I found 
that it did catch the fatal error in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89]
 .
It will resubmit receiveLoop and in the second time it will block by 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69]
This Executor did not initialize successfully in the first time , so it didn't 
send LaunchedExecutor to Driver (you can see 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172]
 )

So the Executor can't launch task, related PR 
[https://github.com/apache/spark/pull/25964] .

Why SparkUncaughtExceptionHandler doesn't catch the fatal error?
See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284]
 plugins is private variable, so it was broken when initialize Executor at the 
beginning.


was (Author: JIRAUSER290821):
[~Ngone51] 
Shouldn't it bring up a new `receiveLoop()` to serve RPC messages?

Yes, my previous thinking was wrong. I remote debug on Executor and I found 
that it did catch the fatal error in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89]
 .
It will resubmit receiveLoop and in the second time it will block by 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69]
This Executor did not initialize successfully in the first time and didn't send 
LaunchedExecutor to Driver (you can see 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172]
 )

So the Executor can't launch task, related PR 
[https://github.com/apache/spark/pull/25964] .

Why SparkUncaughtExceptionHandler doesn't catch the fatal error?
See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284]
 plugins is private variable, so it was broken when initialize Executor at the 
beginning.

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h

[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed

2022-09-07 Thread Santosh Pingale (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-40311:

Docs Text: Add withColumnsRenamed to scala and pyspark API

> Introduce withColumnsRenamed
> 
>
> Key: SPARK-40311
> URL: https://issues.apache.org/jira/browse/SPARK-40311
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Priority: Minor
>
> Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
> single command. Issues are faced when users iteratively perform 
> `withColumnRenamed`.
>  * When it works, we see slower performace
>  * In some cases, StackOverflowError is raised due to logical plan being too 
> big
>  * In a few cases, driver died due to memory consumption
> Some reproducible benchmarks:
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> b = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> c = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> d = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> e = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> f = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(seconds=12, microseconds=480021) {code}
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> b = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> c = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> d = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> e = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> f = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(microseconds=632116) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40087) Support multiple Column drop in R

2022-09-07 Thread Santosh Pingale (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-40087:

Docs Text: Support for SparkR to drop multiple "Column"

> Support multiple Column drop in R
> -
>
> Key: SPARK-40087
> URL: https://issues.apache.org/jira/browse/SPARK-40087
> Project: Spark
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Santosh Pingale
>Assignee: Santosh Pingale
>Priority: Minor
> Fix For: 3.4.0
>
>
> This is a followup on SPARK-39895. The PR previously attempted to adjust 
> implementation for R as well to match signatures but that part was removed 
> and we only focused on getting python implementation to behave correctly.
> *{{Change supports following operations:}}*
> {{df <- select(read.json(jsonPath), "name", "age")}}
> {{df$age2 <- df$age}}
> {{df1 <- drop(df, df$age, df$name)}}
> {{expect_equal(columns(df1), c("age2"))}}
> {{df1 <- drop(df, df$age, column("random"))}}
> {{expect_equal(columns(df1), c("name", "age2"))}}
> {{df1 <- drop(df, df$age, df$name)}}
> {{expect_equal(columns(df1), c("age2"))}}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39895) pyspark drop doesn't accept *cols

2022-09-07 Thread Santosh Pingale (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-39895:

Docs Text: Support for PySpark to drop multiple "Column" 

> pyspark drop doesn't accept *cols 
> --
>
> Key: SPARK-39895
> URL: https://issues.apache.org/jira/browse/SPARK-39895
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Assignee: Santosh Pingale
>Priority: Minor
> Fix For: 3.4.0
>
>
> Pyspark dataframe drop has following signature:
> {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> 
> "DataFrame":}}{color}
> However when we try to pass multiple Column types to drop function it raises 
> TypeError
> {{each col in the param list should be a string}}
> *Minimal reproducible example:*
> {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), 
> ("id_1", 3, 3), ("id_2", 4, 3)]{color}
> {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, 
> count int"){color}
> |– id: string (nullable = true)|
> |– point: integer (nullable = true)|
> |– count: integer (nullable = true)|
> {color:#4c9aff}{{df.drop(df.point, df.count)}}{color}
> {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py 
> in drop(self, *cols){color}
> {color:#505f79}2537 for col in cols:{color}
> {color:#505f79}2538 if not isinstance(col, str):{color}
> {color:#505f79}-> 2539 raise TypeError("each col in the param list should be 
> a string"){color}
> {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color}
> {color:#505f79}2541{color}
> {color:#505f79}TypeError: each col in the param list should be a string{color}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-07 Thread Mars (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601543#comment-17601543
 ] 

Mars commented on SPARK-40320:
--

[~Ngone51] 
Shouldn't it bring up a new `receiveLoop()` to serve RPC messages?
Yes, my previous thinking was wrong. I remote debug on Executor and I found 
that it did catch the fatal error in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89]
 .
It will resubmit receiveLoop and block in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69]
But this Executor did not initialize successfully and didn't send 
LaunchedExecutor to Driver (you can see 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172]
 )

So the Executor can't launch task, related PR 
[https://github.com/apache/spark/pull/25964] .

Why SparkUncaughtExceptionHandler doesn't catch the fatal error?
See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284]
 plugins is private variable, so it was broken when initialize Executor at the 
beginning.

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-07 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-40362:
-
Description: 
In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize.

Consider following expression:

a + b > 10

This GreaterThan expression when canonicalized as a whole for first time,  will 
skip the call to Canonicalize.reorderCommutativeOperators for the Add 
expression as the GreaterThan's canonicalization used precanonicalize on 
children ( the Add expression).

 

so if create a new expression 

b + a > 10 and invoke canonicalize it, the canonicalized versions of these two 
expressions will not match.

The test 
{quote}test("bug X") {
    val tr1 = LocalRelation('c.int, 'b.string, 'a.int)

   val y = tr1.where('a.attr + 'c.attr > 10).analyze
   val fullCond = y.asInstanceOf[Filter].condition.clone()
  val addExpr = (fullCond match
Unknown macro: \{     case GreaterThan(x}
).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}
This test fails.

The fix which I propose is that for the commutative expressions, the 
precanonicalize should be overridden and  
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize. effectively for commutative operands ( add, or , 
multiply , and etc) canonicalize and precanonicalize should be same.



PR:

https://github.com/apache/spark/pull/37824

  was:
In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize.

Consider following expression:

a + b > 10

This GreaterThan expression when canonicalized as a whole for first time,  will 
skip the call to Canonicalize.reorderCommutativeOperators for the Add 
expression as the GreaterThan's canonicalization used precanonicalize on 
children ( the Add expression).

 

so if create a new expression 

b + a > 10 and invoke canonicalize it, the canonicalized versions of these two 
expressions will not match.

The test 
{quote}
test("bug X") {
    val tr1 = LocalRelation('c.int, 'b.string, 'a.int)

   val y = tr1.where('a.attr + 'c.attr > 10).analyze
   val fullCond = y.asInstanceOf[Filter].condition.clone()
  val addExpr = (fullCond match {
    case GreaterThan(x: Add, _) => x
    case LessThan(_, x: Add) => x
  }).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}

This test fails.

The fix which I propose is that for the commutative expressions,  the 
precanonicalize should be overridden and   
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize.  effectively for commutative operands ( add, or , 
multiply , and etc)  canonicalize and precanonicalize should be same.
I will file a PR for it


> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}test("bug X") {
>     val tr1

[jira] [Updated] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-07 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-40362:
-
Shepherd: Wenchen Fan  (was: Tanel Kiis)

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}
> test("bug X") {
>     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
>    val y = tr1.where('a.attr + 'c.attr > 10).analyze
>    val fullCond = y.asInstanceOf[Filter].condition.clone()
>   val addExpr = (fullCond match {
>     case GreaterThan(x: Add, _) => x
>     case LessThan(_, x: Add) => x
>   }).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions,  the 
> precanonicalize should be overridden and   
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize.  effectively for commutative operands ( add, or 
> , multiply , and etc)  canonicalize and precanonicalize should be same.
> I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40291) Improve the message for column not in group by clause error

2022-09-07 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40291.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37742
[https://github.com/apache/spark/pull/37742]

> Improve the message for column not in group by clause error
> ---
>
> Key: SPARK-40291
> URL: https://issues.apache.org/jira/browse/SPARK-40291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.4.0
>
>
> Improve the message for column not in group by clause error to use the new 
> error framework



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40291) Improve the message for column not in group by clause error

2022-09-07 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40291:


Assignee: Linhong Liu

> Improve the message for column not in group by clause error
> ---
>
> Key: SPARK-40291
> URL: https://issues.apache.org/jira/browse/SPARK-40291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
>
> Improve the message for column not in group by clause error to use the new 
> error framework



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40293) Make the V2 table error message more meaningful

2022-09-07 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40293.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37746
[https://github.com/apache/spark/pull/37746]

> Make the V2 table error message more meaningful
> ---
>
> Key: SPARK-40293
> URL: https://issues.apache.org/jira/browse/SPARK-40293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.4.0
>
>
> When V2 catalog is not configured, Spark fails to access/create a table using 
> the V2 API and silently falls back to attempting to do the same operation 
> using the V1 Api. This happens frequently among the users. We want to have a 
> better error message so that users can fix the configuration/usage issue by 
> themselves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40293) Make the V2 table error message more meaningful

2022-09-07 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40293:


Assignee: Huaxin Gao

> Make the V2 table error message more meaningful
> ---
>
> Key: SPARK-40293
> URL: https://issues.apache.org/jira/browse/SPARK-40293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> When V2 catalog is not configured, Spark fails to access/create a table using 
> the V2 API and silently falls back to attempting to do the same operation 
> using the V1 Api. This happens frequently among the users. We want to have a 
> better error message so that users can fix the configuration/usage issue by 
> themselves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40380:


Assignee: (was: Apache Spark)

> Constant-folding of InvokeLike should not result in non-serializable result
> ---
>
> Key: SPARK-40380
> URL: https://issues.apache.org/jira/browse/SPARK-40380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kris Mok
>Priority: Major
>
> SPARK-37907 added constant-folding support to the {{InvokeLike}} family of 
> expressions. Unfortunately it introduced a regression for cases when a 
> constant-folded {{InvokeLike}} expression returned a non-serializable result. 
> {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. 
> when using sparksql-scalapb on Spark 3.3.0+.
> Below is a minimal repro to demonstrate this issue:
> {code:scala}
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> import org.apache.spark.sql.catalyst.expressions.Literal
> import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, 
> StaticInvoke}
> import org.apache.spark.sql.types.{LongType, ObjectType}
> class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = 
> longVal + other }
> case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): 
> NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) }
> val litExpr = Literal.fromObject(SerializableBoxedLong(42L), 
> ObjectType(classOf[SerializableBoxedLong]))
> val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", 
> ObjectType(classOf[NotSerializableBoxedLong]))
> val addExpr = Invoke(toNotSerializableExpr, "add", LongType, 
> Seq(UnresolvedAttribute.quotedString("id")))
> val df = spark.range(2).select(new Column(addExpr))
> df.collect
> {code}
> Before SPARK-37907, this example would run fine and result in {{[[42], 
> [43]]}}. But after SPARK-37907, it'd fail with:
> {code:none}
> ...
> Caused by: java.io.NotSerializableException: NotSerializableBoxedLong
> Serialization stack:
>   - object not serializable (class: NotSerializableBoxedLong, value: 
> NotSerializableBoxedLong@71231636)
>   - element of array (index: 1)
>   - array (class [Ljava.lang.Object;, size 2)
>   - element of array (index: 1)
>   - array (class [Ljava.lang.Object;, size 3)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class 
> org.apache.spark.sql.execution.WholeStageCodegenExec, 
> functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=3])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40380:


Assignee: Apache Spark

> Constant-folding of InvokeLike should not result in non-serializable result
> ---
>
> Key: SPARK-40380
> URL: https://issues.apache.org/jira/browse/SPARK-40380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-37907 added constant-folding support to the {{InvokeLike}} family of 
> expressions. Unfortunately it introduced a regression for cases when a 
> constant-folded {{InvokeLike}} expression returned a non-serializable result. 
> {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. 
> when using sparksql-scalapb on Spark 3.3.0+.
> Below is a minimal repro to demonstrate this issue:
> {code:scala}
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> import org.apache.spark.sql.catalyst.expressions.Literal
> import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, 
> StaticInvoke}
> import org.apache.spark.sql.types.{LongType, ObjectType}
> class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = 
> longVal + other }
> case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): 
> NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) }
> val litExpr = Literal.fromObject(SerializableBoxedLong(42L), 
> ObjectType(classOf[SerializableBoxedLong]))
> val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", 
> ObjectType(classOf[NotSerializableBoxedLong]))
> val addExpr = Invoke(toNotSerializableExpr, "add", LongType, 
> Seq(UnresolvedAttribute.quotedString("id")))
> val df = spark.range(2).select(new Column(addExpr))
> df.collect
> {code}
> Before SPARK-37907, this example would run fine and result in {{[[42], 
> [43]]}}. But after SPARK-37907, it'd fail with:
> {code:none}
> ...
> Caused by: java.io.NotSerializableException: NotSerializableBoxedLong
> Serialization stack:
>   - object not serializable (class: NotSerializableBoxedLong, value: 
> NotSerializableBoxedLong@71231636)
>   - element of array (index: 1)
>   - array (class [Ljava.lang.Object;, size 2)
>   - element of array (index: 1)
>   - array (class [Ljava.lang.Object;, size 3)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class 
> org.apache.spark.sql.execution.WholeStageCodegenExec, 
> functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=3])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601472#comment-17601472
 ] 

Apache Spark commented on SPARK-40380:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/37823

> Constant-folding of InvokeLike should not result in non-serializable result
> ---
>
> Key: SPARK-40380
> URL: https://issues.apache.org/jira/browse/SPARK-40380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kris Mok
>Priority: Major
>
> SPARK-37907 added constant-folding support to the {{InvokeLike}} family of 
> expressions. Unfortunately it introduced a regression for cases when a 
> constant-folded {{InvokeLike}} expression returned a non-serializable result. 
> {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. 
> when using sparksql-scalapb on Spark 3.3.0+.
> Below is a minimal repro to demonstrate this issue:
> {code:scala}
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> import org.apache.spark.sql.catalyst.expressions.Literal
> import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, 
> StaticInvoke}
> import org.apache.spark.sql.types.{LongType, ObjectType}
> class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = 
> longVal + other }
> case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): 
> NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) }
> val litExpr = Literal.fromObject(SerializableBoxedLong(42L), 
> ObjectType(classOf[SerializableBoxedLong]))
> val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", 
> ObjectType(classOf[NotSerializableBoxedLong]))
> val addExpr = Invoke(toNotSerializableExpr, "add", LongType, 
> Seq(UnresolvedAttribute.quotedString("id")))
> val df = spark.range(2).select(new Column(addExpr))
> df.collect
> {code}
> Before SPARK-37907, this example would run fine and result in {{[[42], 
> [43]]}}. But after SPARK-37907, it'd fail with:
> {code:none}
> ...
> Caused by: java.io.NotSerializableException: NotSerializableBoxedLong
> Serialization stack:
>   - object not serializable (class: NotSerializableBoxedLong, value: 
> NotSerializableBoxedLong@71231636)
>   - element of array (index: 1)
>   - array (class [Ljava.lang.Object;, size 2)
>   - element of array (index: 1)
>   - array (class [Ljava.lang.Object;, size 3)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class 
> org.apache.spark.sql.execution.WholeStageCodegenExec, 
> functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=3])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40381) Support standalone worker recommission

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40381:


Assignee: (was: Apache Spark)

> Support standalone worker recommission
> --
>
> Key: SPARK-40381
> URL: https://issues.apache.org/jira/browse/SPARK-40381
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, spark standalone only support kill workers. We may want to 
> recommission some workers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40381) Support standalone worker recommission

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601468#comment-17601468
 ] 

Apache Spark commented on SPARK-40381:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/37822

> Support standalone worker recommission
> --
>
> Key: SPARK-40381
> URL: https://issues.apache.org/jira/browse/SPARK-40381
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, spark standalone only support kill workers. We may want to 
> recommission some workers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40381) Support standalone worker recommission

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601469#comment-17601469
 ] 

Apache Spark commented on SPARK-40381:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/37822

> Support standalone worker recommission
> --
>
> Key: SPARK-40381
> URL: https://issues.apache.org/jira/browse/SPARK-40381
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, spark standalone only support kill workers. We may want to 
> recommission some workers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40381) Support standalone worker recommission

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40381:


Assignee: Apache Spark

> Support standalone worker recommission
> --
>
> Key: SPARK-40381
> URL: https://issues.apache.org/jira/browse/SPARK-40381
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, spark standalone only support kill workers. We may want to 
> recommission some workers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40381) Support standalone worker recommission

2022-09-07 Thread Zhongwei Zhu (Jira)

Zhongwei Zhu created SPARK-40381:


 Summary: Support standalone worker recommission
 Key: SPARK-40381
 URL: https://issues.apache.org/jira/browse/SPARK-40381
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Currently, spark standalone only support kill workers. We may want to 
recommission some workers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result

2022-09-07 Thread Kris Mok (Jira)

Kris Mok created SPARK-40380:


 Summary: Constant-folding of InvokeLike should not result in 
non-serializable result
 Key: SPARK-40380
 URL: https://issues.apache.org/jira/browse/SPARK-40380
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kris Mok


SPARK-37907 added constant-folding support to the {{InvokeLike}} family of 
expressions. Unfortunately it introduced a regression for cases when a 
constant-folded {{InvokeLike}} expression returned a non-serializable result. 
{{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. when 
using sparksql-scalapb on Spark 3.3.0+.

Below is a minimal repro to demonstrate this issue:
{code:scala}
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke}
import org.apache.spark.sql.types.{LongType, ObjectType}
class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = 
longVal + other }
case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): 
NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) }
val litExpr = Literal.fromObject(SerializableBoxedLong(42L), 
ObjectType(classOf[SerializableBoxedLong]))
val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", 
ObjectType(classOf[NotSerializableBoxedLong]))
val addExpr = Invoke(toNotSerializableExpr, "add", LongType, 
Seq(UnresolvedAttribute.quotedString("id")))
val df = spark.range(2).select(new Column(addExpr))
df.collect
{code}

Before SPARK-37907, this example would run fine and result in {{[[42], [43]]}}. 
But after SPARK-37907, it'd fail with:
{code:none}
...
Caused by: java.io.NotSerializableException: NotSerializableBoxedLong
Serialization stack:
- object not serializable (class: NotSerializableBoxedLong, value: 
NotSerializableBoxedLong@71231636)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 2)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class 
org.apache.spark.sql.execution.WholeStageCodegenExec, 
functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 
instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class 
org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, 
org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c)
  at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
  at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
  at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40379:


Assignee: Apache Spark  (was: Holden Karau)

> Propagate decommission executor loss reason during onDisconnect in K8s
> --
>
> Key: SPARK-40379
> URL: https://issues.apache.org/jira/browse/SPARK-40379
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>
> Currently if an executor has been sent a decommission message and then it 
> disconnects from the scheduler we only disable the executor depending on the 
> K8s status events to drive the rest of the state transitions. However, the 
> K8s status events can become overwhelmed on large clusters so we should check 
> if an executor is in a decommissioning state when it is disconnected and use 
> that reason instead of waiting on the K8s status events so we have more 
> accurate logging information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601386#comment-17601386
 ] 

Apache Spark commented on SPARK-40379:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37821

> Propagate decommission executor loss reason during onDisconnect in K8s
> --
>
> Key: SPARK-40379
> URL: https://issues.apache.org/jira/browse/SPARK-40379
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Currently if an executor has been sent a decommission message and then it 
> disconnects from the scheduler we only disable the executor depending on the 
> K8s status events to drive the rest of the state transitions. However, the 
> K8s status events can become overwhelmed on large clusters so we should check 
> if an executor is in a decommissioning state when it is disconnected and use 
> that reason instead of waiting on the K8s status events so we have more 
> accurate logging information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40379:


Assignee: Holden Karau  (was: Apache Spark)

> Propagate decommission executor loss reason during onDisconnect in K8s
> --
>
> Key: SPARK-40379
> URL: https://issues.apache.org/jira/browse/SPARK-40379
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Currently if an executor has been sent a decommission message and then it 
> disconnects from the scheduler we only disable the executor depending on the 
> K8s status events to drive the rest of the state transitions. However, the 
> K8s status events can become overwhelmed on large clusters so we should check 
> if an executor is in a decommissioning state when it is disconnected and use 
> that reason instead of waiting on the K8s status events so we have more 
> accurate logging information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601385#comment-17601385
 ] 

Apache Spark commented on SPARK-40379:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37821

> Propagate decommission executor loss reason during onDisconnect in K8s
> --
>
> Key: SPARK-40379
> URL: https://issues.apache.org/jira/browse/SPARK-40379
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Currently if an executor has been sent a decommission message and then it 
> disconnects from the scheduler we only disable the executor depending on the 
> K8s status events to drive the rest of the state transitions. However, the 
> K8s status events can become overwhelmed on large clusters so we should check 
> if an executor is in a decommissioning state when it is disconnected and use 
> that reason instead of waiting on the K8s status events so we have more 
> accurate logging information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40295) Allow v2 functions with literal args in write distribution and ordering

2022-09-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-40295.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37749
[https://github.com/apache/spark/pull/37749]

> Allow v2 functions with literal args in write distribution and ordering
> ---
>
> Key: SPARK-40295
> URL: https://issues.apache.org/jira/browse/SPARK-40295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark should allow v2 function with literal args in write distribution and 
> ordering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40295) Allow v2 functions with literal args in write distribution and ordering

2022-09-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-40295:


Assignee: Anton Okolnychyi

> Allow v2 functions with literal args in write distribution and ordering
> ---
>
> Key: SPARK-40295
> URL: https://issues.apache.org/jira/browse/SPARK-40295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> Spark should allow v2 function with literal args in write distribution and 
> ordering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40378) What is React Native.

2022-09-07 Thread Nikhil Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Sharma updated SPARK-40378:
--
Summary: What is React Native.  (was: React Native is an open source 
framework for building mobile apps. It was created by Facebook and is designed 
for cross-platform capability.)

> What is React Native.
> -
>
> Key: SPARK-40378
> URL: https://issues.apache.org/jira/browse/SPARK-40378
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.3
>Reporter: Nikhil Sharma
>Priority: Major
> Fix For: 3.1.3
>
>
> React Native is an open-source framework for building mobile apps. It was 
> created by Facebook and is designed for cross-platform capability. It can be 
> tough to choose between an excellent user experience, a beautiful user 
> interface, and fast processing, but [React Native online 
> course|https://www.igmguru.com/digital-marketing-programming/react-native-training/]
>  makes that decision an easy one with powerful native development. Jordan 
> Walke found a way to generate UI elements from a javascript thread and 
> applied it to iOS to build the first native application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-09-07 Thread Holden Karau (Jira)

Holden Karau created SPARK-40379:


 Summary: Propagate decommission executor loss reason during 
onDisconnect in K8s
 Key: SPARK-40379
 URL: https://issues.apache.org/jira/browse/SPARK-40379
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core
Affects Versions: 3.4.0
Reporter: Holden Karau
Assignee: Holden Karau


Currently if an executor has been sent a decommission message and then it 
disconnects from the scheduler we only disable the executor depending on the 
K8s status events to drive the rest of the state transitions. However, the K8s 
status events can become overwhelmed on large clusters so we should check if an 
executor is in a decommissioning state when it is disconnected and use that 
reason instead of waiting on the K8s status events so we have more accurate 
logging information.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40378) React Native is an open source framework for building mobile apps. It was created by Facebook and is designed for cross-platform capability.

2022-09-07 Thread Nikhil Sharma (Jira)

Nikhil Sharma created SPARK-40378:
-

 Summary: React Native is an open source framework for building 
mobile apps. It was created by Facebook and is designed for cross-platform 
capability.
 Key: SPARK-40378
 URL: https://issues.apache.org/jira/browse/SPARK-40378
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.0.3
Reporter: Nikhil Sharma
 Fix For: 3.1.3


React Native is an open-source framework for building mobile apps. It was 
created by Facebook and is designed for cross-platform capability. It can be 
tough to choose between an excellent user experience, a beautiful user 
interface, and fast processing, but [React Native online 
course|https://www.igmguru.com/digital-marketing-programming/react-native-training/]
 makes that decision an easy one with powerful native development. Jordan Walke 
found a way to generate UI elements from a javascript thread and applied it to 
iOS to build the first native application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-40149:

Fix Version/s: 3.2.3

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601358#comment-17601358
 ] 

Apache Spark commented on SPARK-38961:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37820

> Enhance to automatically generate the pandas API support list
> -
>
> Key: SPARK-38961
> URL: https://issues.apache.org/jira/browse/SPARK-38961
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Hyunwoo Park
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the supported pandas API list is manually maintained, so it would 
> be better to make the list automatically generated to reduce the maintenance 
> cost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601359#comment-17601359
 ] 

Apache Spark commented on SPARK-38961:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37820

> Enhance to automatically generate the pandas API support list
> -
>
> Key: SPARK-38961
> URL: https://issues.apache.org/jira/browse/SPARK-38961
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Hyunwoo Park
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the supported pandas API list is manually maintained, so it would 
> be better to make the list automatically generated to reduce the maintenance 
> cost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Elhoussine Talab (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601311#comment-17601311
 ] 

Elhoussine Talab commented on SPARK-40376:
--

Thanks [~srowen] 

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Trivial
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40376:
-
Issue Type: Improvement  (was: Bug)
  Priority: Trivial  (was: Major)

This is not a bug

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Trivial
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40377:


Assignee: (was: Apache Spark)

> Allow customize maxBroadcastTableBytes and maxBroadcastRows
> ---
>
> Key: SPARK-40377
> URL: https://issues.apache.org/jira/browse/SPARK-40377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: LLiu
>Priority: Major
> Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png
>
>
> Recently, we encountered some driver OOM problems. Some tables with large 
> data volume were compressed using Snappy and then broadcast join was 
> performed, but the actual data volume was too large, which resulted in driver 
> OOM.
> The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB 
> and 51200 respectively. Maybe we can allow customization of these values, 
> configure smaller values according to different scenarios, and prohibit 
> broadcast joins for tables with large data volumes to avoid driver OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40377:


Assignee: Apache Spark

> Allow customize maxBroadcastTableBytes and maxBroadcastRows
> ---
>
> Key: SPARK-40377
> URL: https://issues.apache.org/jira/browse/SPARK-40377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: LLiu
>Assignee: Apache Spark
>Priority: Major
> Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png
>
>
> Recently, we encountered some driver OOM problems. Some tables with large 
> data volume were compressed using Snappy and then broadcast join was 
> performed, but the actual data volume was too large, which resulted in driver 
> OOM.
> The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB 
> and 51200 respectively. Maybe we can allow customization of these values, 
> configure smaller values according to different scenarios, and prohibit 
> broadcast joins for tables with large data volumes to avoid driver OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601297#comment-17601297
 ] 

Apache Spark commented on SPARK-40377:
--

User 'LLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37819

> Allow customize maxBroadcastTableBytes and maxBroadcastRows
> ---
>
> Key: SPARK-40377
> URL: https://issues.apache.org/jira/browse/SPARK-40377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: LLiu
>Priority: Major
> Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png
>
>
> Recently, we encountered some driver OOM problems. Some tables with large 
> data volume were compressed using Snappy and then broadcast join was 
> performed, but the actual data volume was too large, which resulted in driver 
> OOM.
> The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB 
> and 51200 respectively. Maybe we can allow customization of these values, 
> configure smaller values according to different scenarios, and prohibit 
> broadcast joins for tables with large data volumes to avoid driver OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-09-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601293#comment-17601293
 ] 

Karl Håkan Nordgren commented on SPARK-38218:
-

[~hyukjin.kwon] : Could I ask for an update? The ticket is marked as "Resolved" 
but the issue persists here https://spark.apache.org/downloads.html

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg, 
> image-2022-02-16-12-26-32-871.png
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows

2022-09-07 Thread LeeeeLiu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LLiu updated SPARK-40377:
-
Attachment: 截屏2022-09-07 20.40.06.png
截屏2022-09-07 20.40.16.png

> Allow customize maxBroadcastTableBytes and maxBroadcastRows
> ---
>
> Key: SPARK-40377
> URL: https://issues.apache.org/jira/browse/SPARK-40377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: LLiu
>Priority: Major
> Attachments: 截屏2022-09-07 20.40.06.png, 截屏2022-09-07 20.40.16.png
>
>
> Recently, we encountered some driver OOM problems. Some tables with large 
> data volume were compressed using Snappy and then broadcast join was 
> performed, but the actual data volume was too large, which resulted in driver 
> OOM.
> The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB 
> and 51200 respectively. Maybe we can allow customization of these values, 
> configure smaller values according to different scenarios, and prohibit 
> broadcast joins for tables with large data volumes to avoid driver OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40377) Allow customize maxBroadcastTableBytes and maxBroadcastRows

2022-09-07 Thread LeeeeLiu (Jira)

LLiu created SPARK-40377:


 Summary: Allow customize maxBroadcastTableBytes and 
maxBroadcastRows
 Key: SPARK-40377
 URL: https://issues.apache.org/jira/browse/SPARK-40377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: LLiu


Recently, we encountered some driver OOM problems. Some tables with large data 
volume were compressed using Snappy and then broadcast join was performed, but 
the actual data volume was too large, which resulted in driver OOM.

The values of maxBroadcastTableBytes and maxBroadcastRows are hardcoded, 8GB 
and 51200 respectively. Maybe we can allow customization of these values, 
configure smaller values according to different scenarios, and prohibit 
broadcast joins for tables with large data volumes to avoid driver OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-09-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40185:
---

Assignee: Vitalii Li

> Remove column suggestion when the candidate list is empty for unresolved 
> column/attribute/map key
> -
>
> Key: SPARK-40185
> URL: https://issues.apache.org/jira/browse/SPARK-40185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
>
> For unresolved column, attribute or map key an error message might contain 
> suggestions from the list. However, when the list is empty the error message 
> looks incomplete:
> `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
> be resolved. Did you mean one of the following? []`
> This issue is to make final suggestion to show only if suggestion list is non 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-09-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40185.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37621
[https://github.com/apache/spark/pull/37621]

> Remove column suggestion when the candidate list is empty for unresolved 
> column/attribute/map key
> -
>
> Key: SPARK-40185
> URL: https://issues.apache.org/jira/browse/SPARK-40185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
> Fix For: 3.4.0
>
>
> For unresolved column, attribute or map key an error message might contain 
> suggestions from the list. However, when the list is empty the error message 
> looks incomplete:
> `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
> be resolved. Did you mean one of the following? []`
> This issue is to make final suggestion to show only if suggestion list is non 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601258#comment-17601258
 ] 

Apache Spark commented on SPARK-40149:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37818

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
> Fix For: 3.4.0, 3.3.1
>
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601257#comment-17601257
 ] 

Apache Spark commented on SPARK-40149:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37818

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
> Fix For: 3.4.0, 3.3.1
>
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40376:


Assignee: Apache Spark

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Assignee: Apache Spark
>Priority: Major
>
> Using `np.bool` generates this warning: 
> ```
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
> ```
>  
> See Numpy's deprecation statement here: 
> https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Elhoussine Talab (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elhoussine Talab updated SPARK-40376:
-
Description: 
Using `np.bool` generates this warning: 
{quote}
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the 
error below and can not continue. Note that 
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on 
failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin 
`bool`. To silence this warning, use `bool` by itself. Doing this will not 
modify any behavior and is safe. If you specifically wanted the numpy scalar 
type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and 
guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]


{quote}
 

See Numpy's deprecation statement here: 
[https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]

  was:
Using `np.bool` generates this warning: 

```
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the 
error below and can not continue. Note that 
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on 
failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin 
`bool`. To silence this warning, use `bool` by itself. Doing this will not 
modify any behavior and is safe. If you specifically wanted the numpy scalar 
type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and 
guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

 

See Numpy's deprecation statement here: 
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Major
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601256#comment-17601256
 ] 

Apache Spark commented on SPARK-40376:
--

User 'ELHoussineT' has created a pull request for this issue:
https://github.com/apache/spark/pull/37817

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Major
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40376:


Assignee: (was: Apache Spark)

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Major
>
> Using `np.bool` generates this warning: 
> ```
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
> ```
>  
> See Numpy's deprecation statement here: 
> https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601255#comment-17601255
 ] 

Apache Spark commented on SPARK-40376:
--

User 'ELHoussineT' has created a pull request for this issue:
https://github.com/apache/spark/pull/37817

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Major
>
> Using `np.bool` generates this warning: 
> ```
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
> ```
>  
> See Numpy's deprecation statement here: 
> https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Elhoussine Talab (Jira)

Elhoussine Talab created SPARK-40376:


 Summary: `np.bool` will be deprecated
 Key: SPARK-40376
 URL: https://issues.apache.org/jira/browse/SPARK-40376
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.3.0
Reporter: Elhoussine Talab


Using `np.bool` generates this warning: 

```
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the 
error below and can not continue. Note that 
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on 
failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin 
`bool`. To silence this warning, use `bool` by itself. Doing this will not 
modify any behavior and is safe. If you specifically wanted the numpy scalar 
type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and 
guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

 

See Numpy's deprecation statement here: 
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40149.
-
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37758
[https://github.com/apache/spark/pull/37758]

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
> Fix For: 3.3.1, 3.4.0
>
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601228#comment-17601228
 ] 

Apache Spark commented on SPARK-40332:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37816

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40332:


Assignee: Apache Spark

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601226#comment-17601226
 ] 

Apache Spark commented on SPARK-40332:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37816

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40332:


Assignee: (was: Apache Spark)

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-07 Thread todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601225#comment-17601225
 ] 

todd commented on SPARK-40298:
--

[~dongjoon] 

I use aws spot cluster to terminate the instance where the pod is located 
during the shuffle read phase of the spark stage. The executed task will throw 
FetchFailed and ExecutorLostFailure exceptions. 

I hope that by reusing the PVC, only the current failed task will be 
recalculated, not the previous stage.

We use this feature to avoid spark task recalculation when the aws spot cluster 
recycles machines, thereby saving computing costs.

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information：*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40373) Implement `ps.show_versions`

2022-09-07 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-40373:
---

 Summary: Implement `ps.show_versions`
 Key: SPARK-40373
 URL: https://issues.apache.org/jira/browse/SPARK-40373
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We want to have `ps.show_versions` to reach the pandas parity.

 

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.show_versions.html.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40374) Migrate type check failures of type creators onto error classes

2022-09-07 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40374:
-
Description: 
Replace TypeCheckFailure by DataTypeMismatch in type checks in the complex type 
creator expressions:
1. CreateMap(3): 
https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L205-L214
2. CreateNamedStruct(3): 
https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L445-L457
3. UpdateFields(2): 
https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L670-L673

  was:
Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
expressions:
1. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
2. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
3. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
4. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
5. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
6. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
7. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
8. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
9. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642


> Migrate type check failures of type creators onto error classes
> ---
>
> Key: SPARK-40374
> URL: https://issues.apache.org/jira/browse/SPARK-40374
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the complex 
> type creator expressions:
> 1. CreateMap(3): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L205-L214
> 2. CreateNamedStruct(3): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L445-L457
> 3. UpdateFields(2): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L670-L673



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40375) Implement `spark.show_versions`

2022-09-07 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-40375:
---

 Summary: Implement `spark.show_versions`
 Key: SPARK-40375
 URL: https://issues.apache.org/jira/browse/SPARK-40375
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We might want to have `spark.show_versions` to provide useful environment 
informations similar to 
[https://pandas.pydata.org/docs/reference/api/pandas.show_versions.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40374) Migrate type check failures of type creators onto error classes

2022-09-07 Thread Max Gekk (Jira)

Max Gekk created SPARK-40374:


 Summary: Migrate type check failures of type creators onto error 
classes
 Key: SPARK-40374
 URL: https://issues.apache.org/jira/browse/SPARK-40374
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
expressions:
1. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
2. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
3. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
4. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
5. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
6. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
7. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
8. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
9. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 129 matches

Mail list logo