[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40493:
--
Fix Version/s: 3.3.2
   (was: 3.3.1)

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.3, 3.3.2
>
>
> Please see https://github.com/apache/spark/pull/30865#issuecomment-755285940 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40494:


Assignee: (was: Apache Spark)

> Optimize the performance of `keys.zipWithIndex.toMap` code pattern 
> ---
>
> Key: SPARK-40494
> URL: https://issues.apache.org/jira/browse/SPARK-40494
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar as SPARK-40175, can use  {{`while loop manually}} style` to optimize 
> the performance of `keys.zipWithIndex.toMap` code pattern in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40494:


Assignee: Apache Spark

> Optimize the performance of `keys.zipWithIndex.toMap` code pattern 
> ---
>
> Key: SPARK-40494
> URL: https://issues.apache.org/jira/browse/SPARK-40494
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Similar as SPARK-40175, can use  {{`while loop manually}} style` to optimize 
> the performance of `keys.zipWithIndex.toMap` code pattern in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606915#comment-17606915
 ] 

Apache Spark commented on SPARK-40494:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37940

> Optimize the performance of `keys.zipWithIndex.toMap` code pattern 
> ---
>
> Key: SPARK-40494
> URL: https://issues.apache.org/jira/browse/SPARK-40494
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar as SPARK-40175, can use  {{`while loop manually}} style` to optimize 
> the performance of `keys.zipWithIndex.toMap` code pattern in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40493:

Description: Please see 
https://github.com/apache/spark/pull/30865#issuecomment-755285940 for more 
details.

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.3.1, 3.2.3
>
>
> Please see https://github.com/apache/spark/pull/30865#issuecomment-755285940 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-40493:
---

Assignee: Yuming Wang

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606907#comment-17606907
 ] 

Yuming Wang commented on SPARK-40493:
-

Issue resolved by pull request 37729
https://github.com/apache/spark/pull/37729

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-40494:


 Summary: Optimize the performance of `keys.zipWithIndex.toMap` 
code pattern 
 Key: SPARK-40494
 URL: https://issues.apache.org/jira/browse/SPARK-40494
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Yang Jie


Similar as SPARK-40175, can use  {{`while loop manually}} style` to optimize 
the performance of `keys.zipWithIndex.toMap` code pattern in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-40493.
-
Resolution: Fixed

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40493:

Fix Version/s: 3.2.3

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40493:

Fix Version/s: 3.3.1

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40493:

Affects Version/s: 3.2.2
   3.2.1
   3.2.0

> Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
> -
>
> Key: SPARK-40493
> URL: https://issues.apache.org/jira/browse/SPARK-40493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-40493:
---

 Summary: Revert "[SPARK-33861][SQL] Simplify conditional in 
predicate"
 Key: SPARK-40493
 URL: https://issues.apache.org/jira/browse/SPARK-40493
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38803) Set minio cpu to 250m (0.25) in K8s IT

2022-09-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38803:
--
Fix Version/s: 3.3.2

> Set minio cpu to 250m (0.25) in K8s IT
> --
>
> Key: SPARK-38803
> URL: https://issues.apache.org/jira/browse/SPARK-38803
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38802) Support spark.kubernetes.test.(driver|executor)RequestCores

2022-09-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38802:
--
Fix Version/s: 3.3.2

> Support spark.kubernetes.test.(driver|executor)RequestCores
> ---
>
> Key: SPARK-38802
> URL: https://issues.apache.org/jira/browse/SPARK-38802
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> [https://github.com/apache/spark/pull/35830#pullrequestreview-929597027]
>  
> Support spark.kubernetes.test.(driver|executor)RequestCores to allow devs 
> setting specific cpu for driver/executor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606885#comment-17606885
 ] 

Apache Spark commented on SPARK-40492:
--

User 'chaoqin-li1123' has created a pull request for this issue:
https://github.com/apache/spark/pull/37935

> Perform maintenance of StateStore instances when they become inactive
> -
>
> Key: SPARK-40492
> URL: https://issues.apache.org/jira/browse/SPARK-40492
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Chaoqin Li
>Priority: Major
>
> Current the maintenance of StateStore is performed by a periodic task in the 
> management thread. If a streaming query become inactive before the next 
> maintenance task fire, its StateStore will be unloaded before cleanup.
> There are 2 cases when a StateStore is unloaded.
>  # StateStoreProvider is not longer active in the system, for example, when a 
> query ends or the spark context terminates.
>  # There is other active StateStoreProvider in the system, for example, when 
> a partition is reassigned.
> In case 1, we should do one last maintenance before unloading the instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606884#comment-17606884
 ] 

Apache Spark commented on SPARK-40492:
--

User 'chaoqin-li1123' has created a pull request for this issue:
https://github.com/apache/spark/pull/37935

> Perform maintenance of StateStore instances when they become inactive
> -
>
> Key: SPARK-40492
> URL: https://issues.apache.org/jira/browse/SPARK-40492
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Chaoqin Li
>Priority: Major
>
> Current the maintenance of StateStore is performed by a periodic task in the 
> management thread. If a streaming query become inactive before the next 
> maintenance task fire, its StateStore will be unloaded before cleanup.
> There are 2 cases when a StateStore is unloaded.
>  # StateStoreProvider is not longer active in the system, for example, when a 
> query ends or the spark context terminates.
>  # There is other active StateStoreProvider in the system, for example, when 
> a partition is reassigned.
> In case 1, we should do one last maintenance before unloading the instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40492:


Assignee: Apache Spark

> Perform maintenance of StateStore instances when they become inactive
> -
>
> Key: SPARK-40492
> URL: https://issues.apache.org/jira/browse/SPARK-40492
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Chaoqin Li
>Assignee: Apache Spark
>Priority: Major
>
> Current the maintenance of StateStore is performed by a periodic task in the 
> management thread. If a streaming query become inactive before the next 
> maintenance task fire, its StateStore will be unloaded before cleanup.
> There are 2 cases when a StateStore is unloaded.
>  # StateStoreProvider is not longer active in the system, for example, when a 
> query ends or the spark context terminates.
>  # There is other active StateStoreProvider in the system, for example, when 
> a partition is reassigned.
> In case 1, we should do one last maintenance before unloading the instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40492:


Assignee: (was: Apache Spark)

> Perform maintenance of StateStore instances when they become inactive
> -
>
> Key: SPARK-40492
> URL: https://issues.apache.org/jira/browse/SPARK-40492
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Chaoqin Li
>Priority: Major
>
> Current the maintenance of StateStore is performed by a periodic task in the 
> management thread. If a streaming query become inactive before the next 
> maintenance task fire, its StateStore will be unloaded before cleanup.
> There are 2 cases when a StateStore is unloaded.
>  # StateStoreProvider is not longer active in the system, for example, when a 
> query ends or the spark context terminates.
>  # There is other active StateStoreProvider in the system, for example, when 
> a partition is reassigned.
> In case 1, we should do one last maintenance before unloading the instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40472) Improve pyspark.sql.function example experience

2022-09-19 Thread deshanxiao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606879#comment-17606879
 ] 

deshanxiao commented on SPARK-40472:


[~hyukjin.kwon] OK, thanks~ 

> Improve pyspark.sql.function example experience
> ---
>
> Key: SPARK-40472
> URL: https://issues.apache.org/jira/browse/SPARK-40472
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Minor
>
> There are many exanple in pyspark.sql.function:
> {code:java}
>     Examples
>     
>     >>> df = spark.range(1)
>     >>> df.select(lit(5).alias('height'), df.id).show()
>     +--+---+
>     |height| id|
>     +--+---+
>     |     5|  0|
>     +--+---+ {code}
> We can add import statements so that the user can directly run it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40472) Improve pyspark.sql.function example experience

2022-09-19 Thread deshanxiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao resolved SPARK-40472.

Resolution: Fixed

> Improve pyspark.sql.function example experience
> ---
>
> Key: SPARK-40472
> URL: https://issues.apache.org/jira/browse/SPARK-40472
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Minor
>
> There are many exanple in pyspark.sql.function:
> {code:java}
>     Examples
>     
>     >>> df = spark.range(1)
>     >>> df.select(lit(5).alias('height'), df.id).show()
>     +--+---+
>     |height| id|
>     +--+---+
>     |     5|  0|
>     +--+---+ {code}
> We can add import statements so that the user can directly run it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive

2022-09-19 Thread Chaoqin Li (Jira)
Chaoqin Li created SPARK-40492:
--

 Summary: Perform maintenance of StateStore instances when they 
become inactive
 Key: SPARK-40492
 URL: https://issues.apache.org/jira/browse/SPARK-40492
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Chaoqin Li


Current the maintenance of StateStore is performed by a periodic task in the 
management thread. If a streaming query become inactive before the next 
maintenance task fire, its StateStore will be unloaded before cleanup.
There are 2 cases when a StateStore is unloaded.
 # StateStoreProvider is not longer active in the system, for example, when a 
query ends or the spark context terminates.
 # There is other active StateStoreProvider in the system, for example, when a 
partition is reassigned.

In case 1, we should do one last maintenance before unloading the instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37275) Support ANSI intervals in PySpark

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37275:
-
Fix Version/s: 3.3.0

> Support ANSI intervals in PySpark
> -
>
> Key: SPARK-37275
> URL: https://issues.apache.org/jira/browse/SPARK-37275
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> This JIRA targets to implement ANSI interval types in PySpark:
> - 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala
> - 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606875#comment-17606875
 ] 

Jungtaek Lim commented on SPARK-40489:
--

It would be nice if you can help Spark to retain the dependency as SLF4J1 but 
also work with SLF4J2. If you meant to propose a PR for achieving this (instead 
of bumping the version), it would be really appreciated!

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> {code}
>   private def isLog4j2(): Boolean = {
> // This distinguishes the log4j 1.2 binding, currently
> // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, 
> currently
> // org.apache.logging.slf4j.Log4jLoggerFactory
> val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
> "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass)
>   }
> {code}
> Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
> should not be using {{StaticLoggerBinder}} to do that detection. There are 
> many other approaches. (The code itself suggest one approach: 
> {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to 
> see if the root logger actually is a {{Log4jLogger}}. There may be even 
> better approaches.)
> The other big problem is relying on the Log4J classes themselves. By relying 
> on those classes, you force me to bring in Log4J as a dependency, which 

[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-40489:
-
Priority: Major  (was: Critical)

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> {code}
>   private def isLog4j2(): Boolean = {
> // This distinguishes the log4j 1.2 binding, currently
> // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, 
> currently
> // org.apache.logging.slf4j.Log4jLoggerFactory
> val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
> "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass)
>   }
> {code}
> Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
> should not be using {{StaticLoggerBinder}} to do that detection. There are 
> many other approaches. (The code itself suggest one approach: 
> {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to 
> see if the root logger actually is a {{Log4jLogger}}. There may be even 
> better approaches.)
> The other big problem is relying on the Log4J classes themselves. By relying 
> on those classes, you force me to bring in Log4J as a dependency, which in 
> the latest versions will register themselves with the service loader 
> mechanism, causing conflicting SLF4J implementations.
> It is paramount that you:
> * Remove all reliance ton {{StaticLoggerBinder}}. If you 

[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606872#comment-17606872
 ] 

Jungtaek Lim commented on SPARK-40489:
--

https://www.slf4j.org/news.html

2022-08-20 - Release of SLF4J 2.0.0
2022-09-14 - Release of SLF4J 2.0.1

It sounds like the new major version upgrade is done in a month ago and we 
don't quite know about stability. It doesn't seem like it is quite urgent to 
set the priority to critical. (I'm going to lower down the priority.) Also, we 
cannot easily move on before we make clear there is NO breakage/behavioral 
change on upgrading Spark version which migrates SLF4J1 to SLF4J2. We wouldn't 
be happy with breaking/behavioral changes the dependency has brought, hence we 
concern about major version upgrade on dependency.

The comment about log4j1 is moot as recent version of Spark uses log4j2.

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Critical
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> {code}
>   private def isLog4j2(): Boolean = {
> // This distinguishes the log4j 1.2 binding, currently
> // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, 
> currently
> // org.apache.logging.slf4j.Log4jLoggerFactory
> val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
> "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass)
>   }
> {code}
> Whatever the wisdom of Spark's relying on Log4J-specific 

[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40460:
-
Fix Version/s: 3.3.2

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606852#comment-17606852
 ] 

Garret Wilson edited comment on SPARK-40489 at 9/20/22 3:15 AM:


# Dropping explicit Log4J 1.x support is certainly one of the things that needs 
to be done immediately. Not only is it full of vulnerabilities, it [reached end 
of life|https://logging.apache.org/log4j/1.2/] over five years ago!
# The Log4J implementation dependencies should be removed from Spark as well. 
See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in 
Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people 
seem to have given any thought or care about, given the zero responses I have 
received so far).
# And of course {{StaticLoggerBinder}} references should be abandoned.

All this should have been done years ago. I mention this to give it some sense 
of urgency, in light of what I will say next.

I hesitate to even mention the following, because it might lower the priority 
of the ticket, but for those who might be in a pickle, I just released 
{{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an 
[adapter|https://github.com/globalmentor/clogr/tree/master/clogr-slf4j1-adapter]
 (a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. 
Just include it as a dependency and Spark will stop breaking. *But this is a 
stop-gap measure! Please fix this bug!* :)


was (Author: garretwilson):
# Dropping explicit Log4J 1.x support is certainly one of the things that needs 
to be done immediately. Not only is it full of vulnerabilities, it [reached end 
of life|https://logging.apache.org/log4j/1.2/] over five years ago!
# The Log4J implementation dependencies should be removed from Spark as well. 
See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in 
Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people 
seem to have given any thought or care about, given the zero responses I have 
received so far).
# And of course {{StaticLoggerBinder}} references should be abandoned.

All this should have been done years ago. I mention this to give it some sense 
of urgency, in light of what I will say next.

I hesitate to even mention the following, because it might lower the priority 
of the ticket, but for those who might be in a pickle, I just released 
{{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an adapter 
(a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. 
Just include it as a dependency and Spark will stop breaking. *But this is a 
stop-gap measure! Please fix this bug!* :)

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Critical
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 

[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606852#comment-17606852
 ] 

Garret Wilson commented on SPARK-40489:
---

# Dropping explicit Log4J 1.x support is certainly one of the things that needs 
to be done immediately. Not only is it full of vulnerabilities, it [reached end 
of life|https://logging.apache.org/log4j/1.2/] over five years ago!
# The Log4J implementation dependencies should be removed from Spark as well. 
See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in 
Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people 
seem to have given any thought or care about, given the zero responses I have 
received so far).
# And of course {{StaticLoggerBinder}} references should be abandoned.

All this should have been done years ago. I mention this to give it some sense 
of urgency, in light of what I will say next.

I hesitate to even mention the following, because it might lower the priority 
of the ticket, but for those who might be in a pickle, I just released 
{{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an adapter 
(a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. 
Just include it as a dependency and Spark will stop breaking. *But this is a 
stop-gap measure! Please fix this bug!* :)

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Critical
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on 

[jira] [Commented] (SPARK-40491) Expose a jdbcRDD function in SparkContext

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606850#comment-17606850
 ] 

Apache Spark commented on SPARK-40491:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37937

> Expose a jdbcRDD function in SparkContext
> -
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606851#comment-17606851
 ] 

Apache Spark commented on SPARK-40490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37938

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40491) Expose a jdbcRDD function in SparkContext

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606848#comment-17606848
 ] 

Apache Spark commented on SPARK-40491:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37937

> Expose a jdbcRDD function in SparkContext
> -
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40490:


Assignee: Apache Spark

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40491) Expose a jdbcRDD function in SparkContext

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40491:


Assignee: Apache Spark

> Expose a jdbcRDD function in SparkContext
> -
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40490:


Assignee: (was: Apache Spark)

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606849#comment-17606849
 ] 

Apache Spark commented on SPARK-40490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37938

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40491) Expose a jdbcRDD function in SparkContext

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40491:


Assignee: (was: Apache Spark)

> Expose a jdbcRDD function in SparkContext
> -
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40491) Expose a jdbcRDD function in SparkContext

2022-09-19 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-40491:
---
Description: According to the legacy document of JdbcRDD, we need to expose 
a jdbcRDD function in SparkContext.  (was: According the legacy document of 
JdbcRDD, we need to expose a jdbcRDD function in SparkContext.)

> Expose a jdbcRDD function in SparkContext
> -
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40491) Expose a jdbcRDD function in SparkContext

2022-09-19 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-40491:
--

 Summary: Expose a jdbcRDD function in SparkContext
 Key: SPARK-40491
 URL: https://issues.apache.org/jira/browse/SPARK-40491
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


According the legacy document of JdbcRDD, we need to expose a jdbcRDD function 
in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-40490:


 Summary: `YarnShuffleIntegrationSuite` no longer verifies 
`registeredExecFile`  reload after  SPARK-17321
 Key: SPARK-40490
 URL: https://issues.apache.org/jira/browse/SPARK-40490
 Project: Spark
  Issue Type: Improvement
  Components: Tests, YARN
Affects Versions: 3.4.0
Reporter: Yang Jie


After SPARK-17321, YarnShuffleService will persist data to local shuffle state 
db and reload data from  local shuffle state db only when Yarn NodeManager  
start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
`YarnShuffleIntegrationSuite` not set this config and the default value of the 
configuration is false,  so `YarnShuffleIntegrationSuite` will neither trigger 
data persistence to the db nor verify the reload of data

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2022-09-19 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606817#comment-17606817
 ] 

Asif commented on SPARK-33152:
--

Added a test *CompareNewAndOldConstraintsSuite* in the PR which when run on 
master will highlight functionality issues with master as well as perf issue.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in 

[jira] [Reopened] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reopened SPARK-39494:
--

> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of native Python scalars is 
> unsupported in PySpark, for example,
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-39494.
--
Resolution: Won't Do

> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of native Python scalars is 
> unsupported in PySpark, for example,
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-40084.
--
Resolution: Resolved

> Upgrade Py4J from 0.10.9.5 to 0.10.9.7
> --
>
> Key: SPARK-40084
> URL: https://issues.apache.org/jira/browse/SPARK-40084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> * Java side: Add support for Java 11/17
> Release note: https://www.py4j.org/changelog.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7

2022-09-19 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606813#comment-17606813
 ] 

Xinrong Meng commented on SPARK-40084:
--

Resolved by https://github.com/apache/spark/pull/37523.

> Upgrade Py4J from 0.10.9.5 to 0.10.9.7
> --
>
> Key: SPARK-40084
> URL: https://issues.apache.org/jira/browse/SPARK-40084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> * Java side: Add support for Java 11/17
> Release note: https://www.py4j.org/changelog.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-40084:


Assignee: BingKun Pan

> Upgrade Py4J from 0.10.9.5 to 0.10.9.7
> --
>
> Key: SPARK-40084
> URL: https://issues.apache.org/jira/browse/SPARK-40084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> * Java side: Add support for Java 11/17
> Release note: https://www.py4j.org/changelog.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39405) NumPy support in SQL

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-39405.
--
Resolution: Resolved

> NumPy support in SQL
> 
>
> Key: SPARK-39405
> URL: https://issues.apache.org/jira/browse/SPARK-39405
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> NumPy is the fundamental package for scientific computing with Python. It is 
> very commonly used, especially in the data science world. For example, Pandas 
> is backed by NumPy, and Tensors also supports interchangeable conversion 
> from/to NumPy arrays. 
>  
> However, PySpark only supports Python built-in types with the exception of 
> “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 
>  
> This issue has been raised multiple times internally and externally, see also 
> SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.
>  
> With the NumPy support in SQL, we expect more adaptations from naive data 
> scientists and newcomers leveraging their existing background and codebase 
> with NumPy.
>  
> See more 
> [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#]
> .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39405) NumPy support in SQL

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-39405:


Assignee: Xinrong Meng

> NumPy support in SQL
> 
>
> Key: SPARK-39405
> URL: https://issues.apache.org/jira/browse/SPARK-39405
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> NumPy is the fundamental package for scientific computing with Python. It is 
> very commonly used, especially in the data science world. For example, Pandas 
> is backed by NumPy, and Tensors also supports interchangeable conversion 
> from/to NumPy arrays. 
>  
> However, PySpark only supports Python built-in types with the exception of 
> “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 
>  
> This issue has been raised multiple times internally and externally, see also 
> SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.
>  
> With the NumPy support in SQL, we expect more adaptations from naive data 
> scientists and newcomers leveraging their existing background and codebase 
> with NumPy.
>  
> See more 
> [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#]
> .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39745) Accept a list that contains NumPy scalars in `createDataFrame`

2022-09-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-39745.
--
Resolution: Won't Do

> Accept a list that contains NumPy scalars in `createDataFrame`
> --
>
> Key: SPARK-39745
> URL: https://issues.apache.org/jira/browse/SPARK-39745
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, only lists of native Python scalars are accepted in 
> `createDataFrame`.
> We should support Numpy scalars as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available

2022-09-19 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40466:


Assignee: Huanli Wang

> Improve the error message if the DSv2 source is disabled but DSv1 streaming 
> source is not available
> ---
>
> Key: SPARK-40466
> URL: https://issues.apache.org/jira/browse/SPARK-40466
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Huanli Wang
>Assignee: Huanli Wang
>Priority: Minor
>
> If the V2 data source is disabled, current behavior will fallback to use V1 
> data source. But it will throw error when the DSv1 is not available. Update 
> the error message to indicate what config variable 
> (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in 
> order to enable the V2 data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available

2022-09-19 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40466.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37917
[https://github.com/apache/spark/pull/37917]

> Improve the error message if the DSv2 source is disabled but DSv1 streaming 
> source is not available
> ---
>
> Key: SPARK-40466
> URL: https://issues.apache.org/jira/browse/SPARK-40466
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Huanli Wang
>Assignee: Huanli Wang
>Priority: Minor
> Fix For: 3.4.0
>
>
> If the V2 data source is disabled, current behavior will fallback to use V1 
> data source. But it will throw error when the DSv1 is not available. Update 
> the error message to indicate what config variable 
> (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in 
> order to enable the V2 data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40286) Load Data from S3 deletes data source file

2022-09-19 Thread Drew (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew reopened SPARK-40286:
--

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

2022-09-19 Thread Drew (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew reopened SPARK-40287:
--

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> --
>
> Key: SPARK-40287
> URL: https://issues.apache.org/jira/browse/SPARK-40287
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-1.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-2.snappy.parquet
>     - part-3.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-1.snappy.parquet
>         - p=y/
>             - part-2.snappy.parquet
>             - part-3.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39991) AQE should use available column statistics from completed query stages

2022-09-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39991:
-

Assignee: Andy Grove

> AQE should use available column statistics from completed query stages
> --
>
> Key: SPARK-39991
> URL: https://issues.apache.org/jira/browse/SPARK-39991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> In QueryStageExec.computeStats we copy partial statistics from materlized 
> query stages by calling QueryStageExec#getRuntimeStatistics, which in turn 
> calls ShuffleExchangeLike#runtimeStatistics or 
> BroadcastExchangeLike#runtimeStatistics.
> Only dataSize and numOutputRows are copied into the new Statistics object:
>  {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     Some(Statistics(dataSize, numOutputRows, isRuntime = true))
>   } else {
>     None
>   }
> {code}
> I would like to also copy over the column statistics stored in 
> Statistics.attributeMap so that they can be fed back into the logical plan 
> optimization phase. This is a small change as shown below:
> {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
> val runtimeStats = getRuntimeStatistics
> val dataSize = runtimeStats.sizeInBytes.max(0)
> val numOutputRows = runtimeStats.rowCount.map(_.max(0))
> val attributeStats = runtimeStats.attributeStats
> Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = 
> true))
>   } else {
> None
>   }
> {code}
> The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
> not currently provide such column statistics, but other custom 
> implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39991) AQE should use available column statistics from completed query stages

2022-09-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39991.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37424
[https://github.com/apache/spark/pull/37424]

> AQE should use available column statistics from completed query stages
> --
>
> Key: SPARK-39991
> URL: https://issues.apache.org/jira/browse/SPARK-39991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.4.0
>
>
> In QueryStageExec.computeStats we copy partial statistics from materlized 
> query stages by calling QueryStageExec#getRuntimeStatistics, which in turn 
> calls ShuffleExchangeLike#runtimeStatistics or 
> BroadcastExchangeLike#runtimeStatistics.
> Only dataSize and numOutputRows are copied into the new Statistics object:
>  {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     Some(Statistics(dataSize, numOutputRows, isRuntime = true))
>   } else {
>     None
>   }
> {code}
> I would like to also copy over the column statistics stored in 
> Statistics.attributeMap so that they can be fed back into the logical plan 
> optimization phase. This is a small change as shown below:
> {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
> val runtimeStats = getRuntimeStatistics
> val dataSize = runtimeStats.sizeInBytes.max(0)
> val numOutputRows = runtimeStats.rowCount.map(_.max(0))
> val attributeStats = runtimeStats.attributeStats
> Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = 
> true))
>   } else {
> None
>   }
> {code}
> The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
> not currently provide such column statistics, but other custom 
> implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40477:


Assignee: (was: Apache Spark)

> Support `NullType` in `ColumnarBatchRow`
> 
>
> Key: SPARK-40477
> URL: https://issues.apache.org/jira/browse/SPARK-40477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> `ColumnarBatchRow.get()` does not support `NullType` currently. Support 
> `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606744#comment-17606744
 ] 

Apache Spark commented on SPARK-40477:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37934

> Support `NullType` in `ColumnarBatchRow`
> 
>
> Key: SPARK-40477
> URL: https://issues.apache.org/jira/browse/SPARK-40477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> `ColumnarBatchRow.get()` does not support `NullType` currently. Support 
> `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606742#comment-17606742
 ] 

Apache Spark commented on SPARK-40477:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37934

> Support `NullType` in `ColumnarBatchRow`
> 
>
> Key: SPARK-40477
> URL: https://issues.apache.org/jira/browse/SPARK-40477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> `ColumnarBatchRow.get()` does not support `NullType` currently. Support 
> `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40477:


Assignee: Apache Spark

> Support `NullType` in `ColumnarBatchRow`
> 
>
> Key: SPARK-40477
> URL: https://issues.apache.org/jira/browse/SPARK-40477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Apache Spark
>Priority: Minor
>
> `ColumnarBatchRow.get()` does not support `NullType` currently. Support 
> `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606684#comment-17606684
 ] 

Apache Spark commented on SPARK-40474:
--

User 'xiaonanyang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37933

> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40474:


Assignee: (was: Apache Spark)

> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606683#comment-17606683
 ] 

Apache Spark commented on SPARK-40474:
--

User 'xiaonanyang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37933

> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40474:


Assignee: Apache Spark

> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Assignee: Apache Spark
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606679#comment-17606679
 ] 

Apache Spark commented on SPARK-40460:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/37932

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Yaohua Zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606678#comment-17606678
 ] 

Yaohua Zhao commented on SPARK-40460:
-

[~kabhwan] You are right! Updated

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-40460:

Affects Version/s: 3.4.0

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-40460:

Affects Version/s: 3.3.1
   3.3.2
   (was: 3.2.0)
   (was: 3.2.1)
   (was: 3.2.2)

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40484) Upgrade log4j2 to 2.19.0

2022-09-19 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-40484.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37926
[https://github.com/apache/spark/pull/37926]

> Upgrade log4j2 to 2.19.0
> 
>
> Key: SPARK-40484
> URL: https://issues.apache.org/jira/browse/SPARK-40484
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40484) Upgrade log4j2 to 2.19.0

2022-09-19 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-40484:
---

Assignee: Yang Jie

> Upgrade log4j2 to 2.19.0
> 
>
> Key: SPARK-40484
> URL: https://issues.apache.org/jira/browse/SPARK-40484
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Piotr Karwasz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606671#comment-17606671
 ] 

Piotr Karwasz commented on SPARK-40489:
---

Since {{StaticLoggerBinder}} is not API, but {{LoggerFactory}} is, replacing 
the code with:
{code:java}
val binderClass = LoggerFactory.getLoggerFactory.getClass.getName
{code}
should work on every version of {{{}SLF4J{}}}.

Dropping Log4j 1.x support might be another solution.

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Critical
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> {code}
>   private def isLog4j2(): Boolean = {
> // This distinguishes the log4j 1.2 binding, currently
> // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, 
> currently
> // org.apache.logging.slf4j.Log4jLoggerFactory
> val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
> "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass)
>   }
> {code}
> Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
> should not be using {{StaticLoggerBinder}} to do that detection. There are 
> many other approaches. (The code itself suggest one approach: 
> {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to 
> see if the root logger actually is a {{Log4jLogger}}. There may be even 
> better approaches.)
> The other big problem is relying on the Log4J classes themselves. By relying 
> on 

[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Garret Wilson updated SPARK-40489:
--
Description: 
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
compatible; an application just needs to use the latest Log4J/Logback 
implementation which has the service loader.

*Above all the application must _not_ use the low-level {{StaticLoggerBinder}} 
method, because it has been removed!*

Unfortunately 
[{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
 uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 
2.x. For example, in my application, I have pulled in the SLF4J 2.x API and 
pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks 
completely just trying to get a Spark session:

{noformat}
Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at 
org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
at 
org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at org.apache.spark.internal.Logging.log(Logging.scala:53)
at org.apache.spark.internal.Logging.log$(Logging.scala:51)
at org.apache.spark.SparkContext.log(SparkContext.scala:84)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
at org.apache.spark.SparkContext.(SparkContext.scala:195)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
at scala.Option.getOrElse(Option.scala:201)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
{noformat}

This is because Spark is playing low-level tricks to find out if the logging 
platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.

{code}
  private def isLog4j2(): Boolean = {
// This distinguishes the log4j 1.2 binding, currently
// org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, currently
// org.apache.logging.slf4j.Log4jLoggerFactory
val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
"org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass)
  }
{code}

Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
should not be using {{StaticLoggerBinder}} to do that detection. There are many 
other approaches. (The code itself suggest one approach: 
{{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see 
if the root logger actually is a {{Log4jLogger}}. There may be even better 
approaches.)

The other big problem is relying on the Log4J classes themselves. By relying on 
those classes, you force me to bring in Log4J as a dependency, which in the 
latest versions will register themselves with the service loader mechanism, 
causing conflicting SLF4J implementations.

It is paramount that you:

* Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, 
please check for it using reflection!
* Remove all static references to the Log4J classes. (In an ideal world you 
wouldn't even be doing Log4J-specific things anyway.) If you must must must do 
Log4J-specific things, access the classes via reflection; don't statically link 
them in the code.

The current situation absolutely (and unnecessarily) 100% breaks the use of 
SLF4J 2.x.

  was:
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade 

[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Garret Wilson updated SPARK-40489:
--
Summary: Spark 3.3.0 breaks with SFL4J 2.  (was: Spark 3.3.0 breaks SFL4J 
2.)

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Critical
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
> should not be using {{StaticLoggerBinder}} to do that detection. There are 
> many other approaches. (The code itself suggest one approach: 
> {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to 
> see if the root logger actually is a {{Log4jLogger}}. There may be even 
> better approaches.)
> The other big problem is relying on the Log4J classes themselves. By relying 
> on those classes, you force me to bring in Log4J as a dependency, which in 
> the latest versions will register themselves with the service loader 
> mechanism, causing conflicting SLF4J implementations.
> It is paramount that you:
> * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use 
> it, please check for it using reflection!
> * Remove all static references to the Log4J classes. (In an ideal world you 
> wouldn't even be doing Log4J-specific things anyway.) If you must must must 
> do Log4J-specific things, access the classes via reflection; don't statically 
> link them in the code.
> The current situation absolutely (and 

[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Garret Wilson updated SPARK-40489:
--
Description: 
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
compatible; an application just needs to use the latest Log4J/Logback 
implementation which has the service loader.

**Above all the application must _not_ use the low-level {{StaticLoggerBinder}} 
method, because it has been removed!**

Unfortunately 
[{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
 uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 
2.x. For example, in my application, I have pulled in the SLF4J 2.x API and 
pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks 
completely just trying to get a Spark session:

{noformat}
Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at 
org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
at 
org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at org.apache.spark.internal.Logging.log(Logging.scala:53)
at org.apache.spark.internal.Logging.log$(Logging.scala:51)
at org.apache.spark.SparkContext.log(SparkContext.scala:84)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
at org.apache.spark.SparkContext.(SparkContext.scala:195)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
at scala.Option.getOrElse(Option.scala:201)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
{nofromat}

This is because Spark is playing low-level tricks to find out if the logging 
platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.

Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
should not be using {{StaticLoggerBinder}} to do that detection. There are many 
other approaches. (The code itself suggest one approach: 
{{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see 
if the root logger actually is a {{Log4jLogger}}. There may be even better 
approaches.)

The other big problem is relying on the Log4J classes themselves. By relying on 
those classes, you force me to bring in Log4J as a dependency, which in the 
latest versions will register themselves with the service loader mechanism, 
causing conflicting SLF4J implementations.

It is paramount that you:

* Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, 
please check for it using reflection!
* Remove all static references to the Log4J classes. (In an ideal world you 
wouldn't even be doing Log4J-specific things anyway.) If you must must must do 
Log4J-specific things, access the classes via reflection; don't statically link 
them in the code.

The current situation absolutely (and unnecessarily) 100% breaks the use of 
SLF4J 2.x.

  was:
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use 

[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Garret Wilson updated SPARK-40489:
--
Description: 
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
compatible; an application just needs to use the latest Log4J/Logback 
implementation which has the service loader.

*Above all the application must _not_ use the low-level {{StaticLoggerBinder}} 
method, because it has been removed!*

Unfortunately 
[{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
 uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 
2.x. For example, in my application, I have pulled in the SLF4J 2.x API and 
pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks 
completely just trying to get a Spark session:

{noformat}
Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at 
org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
at 
org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at org.apache.spark.internal.Logging.log(Logging.scala:53)
at org.apache.spark.internal.Logging.log$(Logging.scala:51)
at org.apache.spark.SparkContext.log(SparkContext.scala:84)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
at org.apache.spark.SparkContext.(SparkContext.scala:195)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
at scala.Option.getOrElse(Option.scala:201)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
{noformat}

This is because Spark is playing low-level tricks to find out if the logging 
platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.

Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
should not be using {{StaticLoggerBinder}} to do that detection. There are many 
other approaches. (The code itself suggest one approach: 
{{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see 
if the root logger actually is a {{Log4jLogger}}. There may be even better 
approaches.)

The other big problem is relying on the Log4J classes themselves. By relying on 
those classes, you force me to bring in Log4J as a dependency, which in the 
latest versions will register themselves with the service loader mechanism, 
causing conflicting SLF4J implementations.

It is paramount that you:

* Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, 
please check for it using reflection!
* Remove all static references to the Log4J classes. (In an ideal world you 
wouldn't even be doing Log4J-specific things anyway.) If you must must must do 
Log4J-specific things, access the classes via reflection; don't statically link 
them in the code.

The current situation absolutely (and unnecessarily) 100% breaks the use of 
SLF4J 2.x.

  was:
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use the 

[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Garret Wilson updated SPARK-40489:
--
Description: 
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
compatible; an application just needs to use the latest Log4J/Logback 
implementation which has the service loader.

**Above all the application must _not_ use the low-level {{StaticLoggerBinder}} 
method, because it has been removed!**

Unfortunately 
[{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
 uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 
2.x. For example, in my application, I have pulled in the SLF4J 2.x API and 
pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks 
completely just trying to get a Spark session:

{noformat}
Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at 
org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
at 
org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at org.apache.spark.internal.Logging.log(Logging.scala:53)
at org.apache.spark.internal.Logging.log$(Logging.scala:51)
at org.apache.spark.SparkContext.log(SparkContext.scala:84)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
at org.apache.spark.SparkContext.(SparkContext.scala:195)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
at scala.Option.getOrElse(Option.scala:201)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
{noformat}

This is because Spark is playing low-level tricks to find out if the logging 
platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.

Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
should not be using {{StaticLoggerBinder}} to do that detection. There are many 
other approaches. (The code itself suggest one approach: 
{{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see 
if the root logger actually is a {{Log4jLogger}}. There may be even better 
approaches.)

The other big problem is relying on the Log4J classes themselves. By relying on 
those classes, you force me to bring in Log4J as a dependency, which in the 
latest versions will register themselves with the service loader mechanism, 
causing conflicting SLF4J implementations.

It is paramount that you:

* Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, 
please check for it using reflection!
* Remove all static references to the Log4J classes. (In an ideal world you 
wouldn't even be doing Log4J-specific things anyway.) If you must must must do 
Log4J-specific things, access the classes via reflection; don't statically link 
them in the code.

The current situation absolutely (and unnecessarily) 100% breaks the use of 
SLF4J 2.x.

  was:
Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use 

[jira] [Created] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.

2022-09-19 Thread Garret Wilson (Jira)
Garret Wilson created SPARK-40489:
-

 Summary: Spark 3.3.0 breaks SFL4J 2.
 Key: SPARK-40489
 URL: https://issues.apache.org/jira/browse/SPARK-40489
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Garret Wilson


Spark breaks fundamentally with SLF4J 2.x because it uses 
{{StaticLoggerBinder}}.

SLF4J is the logging facade that is meant to shield the application from the 
implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 
1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a 
global static singleton instance) rather than the Java {{ServiceLoader}} 
mechanism.

SLF4J 2.x, which has been in development for years, has been released. It 
finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
compatible; an application just needs to use the latest Log4J/Logback 
implementation which has the service loader.

**Above all the application must _not_ use the low-level {{StaticLoggerBinder}} 
method, because it has been removed!**

Unfortunately 
[{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
 uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 
2.x. For example, in my application, I have pulled in the SLF4J 2.x API and 
pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks 
completely just trying to get a Spark session:

{noformat}
Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at 
org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
at 
org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
at 
org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
at org.apache.spark.internal.Logging.log(Logging.scala:53)
at org.apache.spark.internal.Logging.log$(Logging.scala:51)
at org.apache.spark.SparkContext.log(SparkContext.scala:84)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
at org.apache.spark.SparkContext.(SparkContext.scala:195)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
at scala.Option.getOrElse(Option.scala:201)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
{noforomat}

This is because Spark is playing low-level tricks to find out if the logging 
platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.

Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
should not be using {{StaticLoggerBinder}} to do that detection. There are many 
other approaches. (The code itself suggest one approach: 
{{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see 
if the root logger actually is a {{Log4jLogger}}. There may be even better 
approaches.)

The other big problem is relying on the Log4J classes themselves. By relying on 
those classes, you force me to bring in Log4J as a dependency, which in the 
latest versions will register themselves with the service loader mechanism, 
causing conflicting SLF4J implementations.

It is paramount that you:

* Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, 
please check for it using reflection!
* Remove all static references to the Log4J classes. (In an ideal world you 
wouldn't even be doing Log4J-specific things anyway.) If you must must must do 
Log4J-specific things, access the classes via reflection; don't statically link 
them in the code.

The current situation absolutely (and unnecessarily) 100% breaks the use of 
SLF4J 2.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40413) Column.isin produces non-boolean results

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40413.
--
Resolution: Invalid

> Column.isin produces non-boolean results
> 
>
> Key: SPARK-40413
> URL: https://issues.apache.org/jira/browse/SPARK-40413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Andreas Franz
>Priority: Major
>
> I observed an inconsistent behaviour using the Column.isin function. The 
> [documentation|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html#isin(list:Any*):org.apache.spark.sql.Column]
>  states that an "up-cast" takes places when different data types are 
> involved. When working with _null_ values the results are confusing to me.
> I prepared a small example demonstrating the issue
> {code:java}
> package example
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.functions._
> object Test {
> def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder()
> .appName("App")
> .master("local[*]")
> .config("spark.driver.host", "localhost")
> .config("spark.ui.enabled", "false")
> .getOrCreate()
> val schema = StructType(
> Array(
> StructField("name", StringType, nullable = true)
> )
> )
> val data = Seq(
> Row("a"),
> Row("b"),
> Row("c"),
> Row(""),
> Row(null)
> ).toList
> val list1 = Array("a", "d", "")
> val list2 = Array("a", "d", "", null)
> val dataFrame = 
> spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> dataFrame
> .withColumn("name_is_in_list_1", col("name").isin(list1: _*))
> .show(10, truncate = false)
> /*
> ++-+
> |name|name_is_in_list_1|
> ++-+
> |a   |true |
> |b   |false|
> |c   |false|
> ||true |
> |null|null | // check value null is not contained in 
> list1, why is null returned here? Expected result: false
> ++-+
>  */
> dataFrame
> .withColumn("name_is_in_list_2", col("name").isin(list2: _*))
> .show(10, truncate = false)
> /*
> ++-+
> |name|name_is_in_list_2|
> ++-+
> |a   |true |
> |b   |null | // check value "b" is not contained in 
> list1, why is null returned here? Expected result: false
> |c   |null | // check value "c" is not contained in 
> list1, why is null returned here? Expected result: false
> ||true |
> |null|null | // check value null is in list1, why is 
> null returned here? Expected result: true
> ++-+
>  */
> val data2 = Seq(
> Row("a"),
> Row("b"),
> Row("c"),
> Row(""),
> ).toList
> val dataFrame2 = 
> spark.createDataFrame(spark.sparkContext.parallelize(data2), schema)
> dataFrame2
> .withColumn("name_is_in_list_2", col("name").isin(list2: _*))
> .show(10, truncate = false)
> 
> /*
> ++-+
> |name|name_is_in_list_2|
> ++-+
> |a   |true |
> |b   |null | // check value "b" is not contained in 
> list2, why is null returned here? Expected result: false
> |c   |null | // check value "b" is not contained in 
> list2, why is null returned here? Expected result: false
> ||true |
> ++-+
>  */
> }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40456.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> PartitionIterator.hasNext should be cheap to call repeatedly
> 
>
> Key: SPARK-40456
> URL: https://issues.apache.org/jira/browse/SPARK-40456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Richard Chen
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40456:
---

Assignee: Richard Chen  (was: Wenchen Fan)

> PartitionIterator.hasNext should be cheap to call repeatedly
> 
>
> Key: SPARK-40456
> URL: https://issues.apache.org/jira/browse/SPARK-40456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Richard Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40456:
-
Priority: Minor  (was: Major)

> PartitionIterator.hasNext should be cheap to call repeatedly
> 
>
> Key: SPARK-40456
> URL: https://issues.apache.org/jira/browse/SPARK-40456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40294) Repeat calls to `PartitionIterator.hasNext` can timeout

2022-09-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40294.
--
Resolution: Duplicate

> Repeat calls to `PartitionIterator.hasNext` can timeout
> ---
>
> Key: SPARK-40294
> URL: https://issues.apache.org/jira/browse/SPARK-40294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Richard Chen
>Priority: Major
>
> Repeat calls to {{PartitionIterator.hasNext}} where both calls return 
> {{false}} can result in timeouts. For example, 
> {{{}KafkaBatchPartitionReader.next(){}}}, which calls {{consumer.get}} (which 
> can potentially timeout with repeat calls), is called by 
> {{{}PartitionIterator.hasNext{}}}. Thus, repeat calls to 
> {{PartitionIterator.hasNext}} by its parent could timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606544#comment-17606544
 ] 

Apache Spark commented on SPARK-40488:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/37931

> Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
> ---
>
> Key: SPARK-40488
> URL: https://issues.apache.org/jira/browse/SPARK-40488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bo Zhang
>Priority: Major
>
> Exceptions thrown in FileFormatWriter.write are wrapped with 
> SparkException("Job aborted.").
> This wrapping provides little extra information, but generates a long 
> stacktrace, which hinders debugging when error happens.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40488:


Assignee: Apache Spark

> Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
> ---
>
> Key: SPARK-40488
> URL: https://issues.apache.org/jira/browse/SPARK-40488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bo Zhang
>Assignee: Apache Spark
>Priority: Major
>
> Exceptions thrown in FileFormatWriter.write are wrapped with 
> SparkException("Job aborted.").
> This wrapping provides little extra information, but generates a long 
> stacktrace, which hinders debugging when error happens.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40488:


Assignee: (was: Apache Spark)

> Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
> ---
>
> Key: SPARK-40488
> URL: https://issues.apache.org/jira/browse/SPARK-40488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bo Zhang
>Priority: Major
>
> Exceptions thrown in FileFormatWriter.write are wrapped with 
> SparkException("Job aborted.").
> This wrapping provides little extra information, but generates a long 
> stacktrace, which hinders debugging when error happens.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40419.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37873
[https://github.com/apache/spark/pull/37873]

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40419:


Assignee: Haejoon Lee

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException

2022-09-19 Thread Bo Zhang (Jira)
Bo Zhang created SPARK-40488:


 Summary: Do not wrap exceptions thrown in FileFormatWriter.write 
with SparkException
 Key: SPARK-40488
 URL: https://issues.apache.org/jira/browse/SPARK-40488
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Bo Zhang


Exceptions thrown in FileFormatWriter.write are wrapped with 
SparkException("Job aborted.").

This wrapping provides little extra information, but generates a long 
stacktrace, which hinders debugging when error happens.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606537#comment-17606537
 ] 

Apache Spark commented on SPARK-40487:
--

User 'xingczhao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37930

> Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
> ---
>
> Key: SPARK-40487
> URL: https://issues.apache.org/jira/browse/SPARK-40487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xingchao, Zhang
>Priority: Major
>
> The 'Part 1' and 'Part 2' could run in parallel
> {code:java}
>   /**
>* The implementation for these joins:
>*
>*   LeftOuter with BuildLeft
>*   RightOuter with BuildRight
>*   FullOuter
>*/
>   private def defaultJoin(relation: Broadcast[Array[InternalRow]]): 
> RDD[InternalRow] = {
> val streamRdd = streamed.execute()
> // Part 1
> val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, 
> relation)
> val notMatchedBroadcastRows: Seq[InternalRow] = {
>   val nulls = new GenericInternalRow(streamed.output.size)
>   val buf: CompactBuffer[InternalRow] = new CompactBuffer()
>   val joinedRow = new JoinedRow
>   joinedRow.withLeft(nulls)
>   var i = 0
>   val buildRows = relation.value
>   while (i < buildRows.length) {
> if (!matchedBroadcastRows.get(i)) {
>   buf += joinedRow.withRight(buildRows(i)).copy()
> }
> i += 1
>   }
>   buf
> }
> // Part 2
> val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter =>
>   val buildRows = relation.value
>   val joinedRow = new JoinedRow
>   val nulls = new GenericInternalRow(broadcast.output.size)
>   streamedIter.flatMap { streamedRow =>
> var i = 0
> var foundMatch = false
> val matchedRows = new CompactBuffer[InternalRow]
> while (i < buildRows.length) {
>   if (boundCondition(joinedRow(streamedRow, buildRows(i {
> matchedRows += joinedRow.copy()
> foundMatch = true
>   }
>   i += 1
> }
> if (!foundMatch && joinType == FullOuter) {
>   matchedRows += joinedRow(streamedRow, nulls).copy()
> }
> matchedRows.iterator
>   }
> }
> // Union
> sparkContext.union(
>   matchedStreamRows,
>   sparkContext.makeRDD(notMatchedBroadcastRows)
> )
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40487:


Assignee: Apache Spark

> Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
> ---
>
> Key: SPARK-40487
> URL: https://issues.apache.org/jira/browse/SPARK-40487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xingchao, Zhang
>Assignee: Apache Spark
>Priority: Major
>
> The 'Part 1' and 'Part 2' could run in parallel
> {code:java}
>   /**
>* The implementation for these joins:
>*
>*   LeftOuter with BuildLeft
>*   RightOuter with BuildRight
>*   FullOuter
>*/
>   private def defaultJoin(relation: Broadcast[Array[InternalRow]]): 
> RDD[InternalRow] = {
> val streamRdd = streamed.execute()
> // Part 1
> val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, 
> relation)
> val notMatchedBroadcastRows: Seq[InternalRow] = {
>   val nulls = new GenericInternalRow(streamed.output.size)
>   val buf: CompactBuffer[InternalRow] = new CompactBuffer()
>   val joinedRow = new JoinedRow
>   joinedRow.withLeft(nulls)
>   var i = 0
>   val buildRows = relation.value
>   while (i < buildRows.length) {
> if (!matchedBroadcastRows.get(i)) {
>   buf += joinedRow.withRight(buildRows(i)).copy()
> }
> i += 1
>   }
>   buf
> }
> // Part 2
> val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter =>
>   val buildRows = relation.value
>   val joinedRow = new JoinedRow
>   val nulls = new GenericInternalRow(broadcast.output.size)
>   streamedIter.flatMap { streamedRow =>
> var i = 0
> var foundMatch = false
> val matchedRows = new CompactBuffer[InternalRow]
> while (i < buildRows.length) {
>   if (boundCondition(joinedRow(streamedRow, buildRows(i {
> matchedRows += joinedRow.copy()
> foundMatch = true
>   }
>   i += 1
> }
> if (!foundMatch && joinType == FullOuter) {
>   matchedRows += joinedRow(streamedRow, nulls).copy()
> }
> matchedRows.iterator
>   }
> }
> // Union
> sparkContext.union(
>   matchedStreamRows,
>   sparkContext.makeRDD(notMatchedBroadcastRows)
> )
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606536#comment-17606536
 ] 

Apache Spark commented on SPARK-40487:
--

User 'xingczhao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37930

> Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
> ---
>
> Key: SPARK-40487
> URL: https://issues.apache.org/jira/browse/SPARK-40487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xingchao, Zhang
>Priority: Major
>
> The 'Part 1' and 'Part 2' could run in parallel
> {code:java}
>   /**
>* The implementation for these joins:
>*
>*   LeftOuter with BuildLeft
>*   RightOuter with BuildRight
>*   FullOuter
>*/
>   private def defaultJoin(relation: Broadcast[Array[InternalRow]]): 
> RDD[InternalRow] = {
> val streamRdd = streamed.execute()
> // Part 1
> val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, 
> relation)
> val notMatchedBroadcastRows: Seq[InternalRow] = {
>   val nulls = new GenericInternalRow(streamed.output.size)
>   val buf: CompactBuffer[InternalRow] = new CompactBuffer()
>   val joinedRow = new JoinedRow
>   joinedRow.withLeft(nulls)
>   var i = 0
>   val buildRows = relation.value
>   while (i < buildRows.length) {
> if (!matchedBroadcastRows.get(i)) {
>   buf += joinedRow.withRight(buildRows(i)).copy()
> }
> i += 1
>   }
>   buf
> }
> // Part 2
> val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter =>
>   val buildRows = relation.value
>   val joinedRow = new JoinedRow
>   val nulls = new GenericInternalRow(broadcast.output.size)
>   streamedIter.flatMap { streamedRow =>
> var i = 0
> var foundMatch = false
> val matchedRows = new CompactBuffer[InternalRow]
> while (i < buildRows.length) {
>   if (boundCondition(joinedRow(streamedRow, buildRows(i {
> matchedRows += joinedRow.copy()
> foundMatch = true
>   }
>   i += 1
> }
> if (!foundMatch && joinType == FullOuter) {
>   matchedRows += joinedRow(streamedRow, nulls).copy()
> }
> matchedRows.iterator
>   }
> }
> // Union
> sparkContext.union(
>   matchedStreamRows,
>   sparkContext.makeRDD(notMatchedBroadcastRows)
> )
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40487:


Assignee: (was: Apache Spark)

> Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
> ---
>
> Key: SPARK-40487
> URL: https://issues.apache.org/jira/browse/SPARK-40487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xingchao, Zhang
>Priority: Major
>
> The 'Part 1' and 'Part 2' could run in parallel
> {code:java}
>   /**
>* The implementation for these joins:
>*
>*   LeftOuter with BuildLeft
>*   RightOuter with BuildRight
>*   FullOuter
>*/
>   private def defaultJoin(relation: Broadcast[Array[InternalRow]]): 
> RDD[InternalRow] = {
> val streamRdd = streamed.execute()
> // Part 1
> val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, 
> relation)
> val notMatchedBroadcastRows: Seq[InternalRow] = {
>   val nulls = new GenericInternalRow(streamed.output.size)
>   val buf: CompactBuffer[InternalRow] = new CompactBuffer()
>   val joinedRow = new JoinedRow
>   joinedRow.withLeft(nulls)
>   var i = 0
>   val buildRows = relation.value
>   while (i < buildRows.length) {
> if (!matchedBroadcastRows.get(i)) {
>   buf += joinedRow.withRight(buildRows(i)).copy()
> }
> i += 1
>   }
>   buf
> }
> // Part 2
> val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter =>
>   val buildRows = relation.value
>   val joinedRow = new JoinedRow
>   val nulls = new GenericInternalRow(broadcast.output.size)
>   streamedIter.flatMap { streamedRow =>
> var i = 0
> var foundMatch = false
> val matchedRows = new CompactBuffer[InternalRow]
> while (i < buildRows.length) {
>   if (boundCondition(joinedRow(streamedRow, buildRows(i {
> matchedRows += joinedRow.copy()
> foundMatch = true
>   }
>   i += 1
> }
> if (!foundMatch && joinType == FullOuter) {
>   matchedRows += joinedRow(streamedRow, nulls).copy()
> }
> matchedRows.iterator
>   }
> }
> // Union
> sparkContext.union(
>   matchedStreamRows,
>   sparkContext.makeRDD(notMatchedBroadcastRows)
> )
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread Xingchao, Zhang (Jira)
Xingchao, Zhang created SPARK-40487:
---

 Summary: Make defaultJoin in BroadcastNestedLoopJoinExec running 
in parallel
 Key: SPARK-40487
 URL: https://issues.apache.org/jira/browse/SPARK-40487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Xingchao, Zhang


The 'Part 1' and 'Part 2' could run in parallel
{code:java}
  /**
   * The implementation for these joins:
   *
   *   LeftOuter with BuildLeft
   *   RightOuter with BuildRight
   *   FullOuter
   */
  private def defaultJoin(relation: Broadcast[Array[InternalRow]]): 
RDD[InternalRow] = {
val streamRdd = streamed.execute()

// Part 1
val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, 
relation)
val notMatchedBroadcastRows: Seq[InternalRow] = {
  val nulls = new GenericInternalRow(streamed.output.size)
  val buf: CompactBuffer[InternalRow] = new CompactBuffer()
  val joinedRow = new JoinedRow
  joinedRow.withLeft(nulls)
  var i = 0
  val buildRows = relation.value
  while (i < buildRows.length) {
if (!matchedBroadcastRows.get(i)) {
  buf += joinedRow.withRight(buildRows(i)).copy()
}
i += 1
  }
  buf
}

// Part 2
val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter =>
  val buildRows = relation.value
  val joinedRow = new JoinedRow
  val nulls = new GenericInternalRow(broadcast.output.size)

  streamedIter.flatMap { streamedRow =>
var i = 0
var foundMatch = false
val matchedRows = new CompactBuffer[InternalRow]

while (i < buildRows.length) {
  if (boundCondition(joinedRow(streamedRow, buildRows(i {
matchedRows += joinedRow.copy()
foundMatch = true
  }
  i += 1
}

if (!foundMatch && joinType == FullOuter) {
  matchedRows += joinedRow(streamedRow, nulls).copy()
}
matchedRows.iterator
  }
}

// Union
sparkContext.union(
  matchedStreamRows,
  sparkContext.makeRDD(notMatchedBroadcastRows)
)
  }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606528#comment-17606528
 ] 

Apache Spark commented on SPARK-40486:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37929

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40486:


Assignee: (was: Apache Spark)

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40486:


Assignee: Apache Spark

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606527#comment-17606527
 ] 

Apache Spark commented on SPARK-40486:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37929

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40486:
--
Parent: SPARK-40327
Issue Type: Sub-task  (was: Improvement)

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40486:
-

 Summary: Implement `spearman` and `kendall` in `DataFrame.corrwith`
 Key: SPARK-40486
 URL: https://issues.apache.org/jira/browse/SPARK-40486
 Project: Spark
  Issue Type: Improvement
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40483) Add `CONNECT` label

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40483.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37925
[https://github.com/apache/spark/pull/37925]

> Add `CONNECT` label
> ---
>
> Key: SPARK-40483
> URL: https://issues.apache.org/jira/browse/SPARK-40483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40483) Add `CONNECT` label

2022-09-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40483:


Assignee: Hyukjin Kwon

> Add `CONNECT` label
> ---
>
> Key: SPARK-40483
> URL: https://issues.apache.org/jira/browse/SPARK-40483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >