[jira] [Resolved] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38073.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35396
[https://github.com/apache/spark/pull/35396]

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487415#comment-17487415
 ] 

Apache Spark commented on SPARK-36665:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/35400

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487413#comment-17487413
 ] 

Apache Spark commented on SPARK-36665:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/35400

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38073:
-

Assignee: Maciej Szymkiewicz

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38073:
--
Issue Type: Bug  (was: Improvement)

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version to 1.15

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38082:
--
Summary: Update minimum numpy version to 1.15  (was: Update minimum numpy 
version)

> Update minimum numpy version to 1.15
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38082) Update minimum numpy version

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38082:
-

Assignee: Maciej Szymkiewicz

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38082) Update minimum numpy version

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38082.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35398
[https://github.com/apache/spark/pull/35398]

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36837.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/34089

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37408) Inline type hints for python/pyspark/ml/image.py

2022-02-04 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37408:
--

Assignee: Maciej Szymkiewicz  (was: Apache Spark)

> Inline type hints for python/pyspark/ml/image.py
> 
>
> Key: SPARK-37408
> URL: https://issues.apache.org/jira/browse/SPARK-37408
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/image.pyi to 
> python/pyspark/ml/image.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2022-02-04 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487376#comment-17487376
 ] 

Sean R. Owen commented on SPARK-6305:
-

Spark doesn't use JDBCAppender, and doesn't use Chainsaw, so I don't believe 
either of those apply.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Tal Sliwowicz
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487359#comment-17487359
 ] 

Apache Spark commented on SPARK-37416:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35399

> Inline type hints for python/pyspark/ml/wrapper.py
> --
>
> Key: SPARK-37416
> URL: https://issues.apache.org/jira/browse/SPARK-37416
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/wrapper.pyi to 
> python/pyspark/ml/wrapper.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487358#comment-17487358
 ] 

Apache Spark commented on SPARK-37416:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35399

> Inline type hints for python/pyspark/ml/wrapper.py
> --
>
> Key: SPARK-37416
> URL: https://issues.apache.org/jira/browse/SPARK-37416
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/wrapper.pyi to 
> python/pyspark/ml/wrapper.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37416:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/wrapper.py
> --
>
> Key: SPARK-37416
> URL: https://issues.apache.org/jira/browse/SPARK-37416
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/wrapper.pyi to 
> python/pyspark/ml/wrapper.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37416:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/wrapper.py
> --
>
> Key: SPARK-37416
> URL: https://issues.apache.org/jira/browse/SPARK-37416
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/wrapper.pyi to 
> python/pyspark/ml/wrapper.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36986) Improving external schema management flexibility

2022-02-04 Thread Rodrigo Boavida (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rodrigo Boavida updated SPARK-36986:

Docs Text: 
Schema management improvements 
1 - Retrieving a field name and type from a schema based on its index

  was:
Schema management improvements 
1 - Retrieving a field name and type from a schema based on its index
2 - Allowing external dataSet schemas to be provided as well their external 
generated rows.


> Improving external schema management flexibility
> 
>
> Key: SPARK-36986
> URL: https://issues.apache.org/jira/browse/SPARK-36986
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rodrigo Boavida
>Priority: Major
>
> Our spark usage, requires us to build an external schema and pass it on while 
> creating a DataSet.
> While working through this, I found a couple of optimizations would improve 
> greatly Spark's flexibility to handle external schema management.
> Scope: ability to retrieve a field's name and schema in one single call, 
> requesting to return a tupple by index. 
> Means extending the StructType class to support an additional method
> This is what the function would look like:
> /**
>  * Returns the index and field structure by name.
>  * If it doesn't find it, returns None.
>  * Avoids two client calls/loops to obtain consolidated field info.
> *
> */
> def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   
> val field = nameToField.get(name)   if(field.isDefined) \{     
> Some((fieldIndex(name), field.get))   }
> else
> {     None   }
> }
> This is particularly useful from an efficiency perspective, when we're 
> parsing a Json structure and we want to check for every field what is the 
> name and field type already defined in the schema
> I will create a corresponding branch for PR review, assuming that there are 
> no concerns with the above proposal.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36986) Improving external schema management flexibility

2022-02-04 Thread Rodrigo Boavida (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rodrigo Boavida updated SPARK-36986:

Description: 
Our spark usage, requires us to build an external schema and pass it on while 
creating a DataSet.

While working through this, I found a couple of optimizations would improve 
greatly Spark's flexibility to handle external schema management.

Scope: ability to retrieve a field's name and schema in one single call, 
requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
*
*/
def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   val 
field = nameToField.get(name)   if(field.isDefined) \{     
Some((fieldIndex(name), field.get))   }
else
{     None   }
}

This is particularly useful from an efficiency perspective, when we're parsing 
a Json structure and we want to check for every field what is the name and 
field type already defined in the schema

I will create a corresponding branch for PR review, assuming that there are no 
concerns with the above proposal.

 

  was:
Our spark usage, requires us to build an external schema and pass it on while 
creating a DataSet.

While working through this, I found a couple of optimizations would improve 
greatly Spark's flexibility to handle external schema management.

1 - ability to retrieve a field's name and schema in one single call, 
requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
 *
 */
 def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = {
   val field = nameToField.get(name)
   if(field.isDefined) \{     Some((fieldIndex(name), field.get))   }
else
{     None   }
}

This is particularly useful from an efficiency perspective, when we're parsing 
a Json structure and we want to check for every field what is the name and 
field type already defined in the schema

 

2 - Allowing for a dataset to be created from a schema, and passing the 
corresponding internal rows which the internal types map with the schema 
already defined externally. This allows to create Spark fields based on any 
data structure, without depending on Spark's internal conversions (in 
particular for Json parsing), and improves performance by skipping the 
CatalystConverts job of converting native Java types into Spark types.

This is what the function would look like:

 

/**
 * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
 * the caller to create externally the InternalRow set, as we as define the 
schema externally.
 *
 * @since 3.3.0
 */
 def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = \{  
 val attributes = schema.toAttributes   val plan = LogicalRDD(attributes, 
data)(self)   val qe = sessionState.executePlan(plan)   qe.assertAnalyzed()   
new Dataset[Row](this, plan, RowEncoder(schema)) }

 

This is similar to this function:

def createDataFrame(rows: java.util.List[Row], schema: StructType): DataFrame

But doesn't depend on Spark internally creating the RDD based by inferring for 
example from a Json structure. Which is not useful if we're managing the schema 
externally.

Also skips the Catalyst conversions and corresponding object overhead, making 
the internal rows generation much more efficient, by being done explicitly from 
the caller.

 

I will create a corresponding branch for PR review, assuming that there are no 
concerns with the above proposals.

 


> Improving external schema management flexibility
> 
>
> Key: SPARK-36986
> URL: https://issues.apache.org/jira/browse/SPARK-36986
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rodrigo Boavida
>Priority: Major
>
> Our spark usage, requires us to build an external schema and pass it on while 
> creating a DataSet.
> While working through this, I found a couple of optimizations would improve 
> greatly Spark's flexibility to handle external schema management.
> Scope: ability to retrieve a field's name and schema in one single call, 
> requesting to return a tupple by index. 
> Means extending the StructType class to support an additional method
> This is what the function would look like:
> /**
>  * Returns the index and field structure by name.
>  * If it doesn't find it, returns None.
> 

[jira] [Updated] (SPARK-36986) Improving external schema management flexibility

2022-02-04 Thread Rodrigo Boavida (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rodrigo Boavida updated SPARK-36986:

Priority: Major  (was: Minor)

> Improving external schema management flexibility
> 
>
> Key: SPARK-36986
> URL: https://issues.apache.org/jira/browse/SPARK-36986
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rodrigo Boavida
>Priority: Major
>
> Our spark usage, requires us to build an external schema and pass it on while 
> creating a DataSet.
> While working through this, I found a couple of optimizations would improve 
> greatly Spark's flexibility to handle external schema management.
> 1 - ability to retrieve a field's name and schema in one single call, 
> requesting to return a tupple by index. 
> Means extending the StructType class to support an additional method
> This is what the function would look like:
> /**
>  * Returns the index and field structure by name.
>  * If it doesn't find it, returns None.
>  * Avoids two client calls/loops to obtain consolidated field info.
>  *
>  */
>  def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = {
>    val field = nameToField.get(name)
>    if(field.isDefined) \{     Some((fieldIndex(name), field.get))   }
> else
> {     None   }
> }
> This is particularly useful from an efficiency perspective, when we're 
> parsing a Json structure and we want to check for every field what is the 
> name and field type already defined in the schema
>  
> 2 - Allowing for a dataset to be created from a schema, and passing the 
> corresponding internal rows which the internal types map with the schema 
> already defined externally. This allows to create Spark fields based on any 
> data structure, without depending on Spark's internal conversions (in 
> particular for Json parsing), and improves performance by skipping the 
> CatalystConverts job of converting native Java types into Spark types.
> This is what the function would look like:
>  
> /**
>  * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
> method allows
>  * the caller to create externally the InternalRow set, as we as define the 
> schema externally.
>  *
>  * @since 3.3.0
>  */
>  def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = 
> \{   val attributes = schema.toAttributes   val plan = LogicalRDD(attributes, 
> data)(self)   val qe = sessionState.executePlan(plan)   qe.assertAnalyzed()   
> new Dataset[Row](this, plan, RowEncoder(schema)) }
>  
> This is similar to this function:
> def createDataFrame(rows: java.util.List[Row], schema: StructType): DataFrame
> But doesn't depend on Spark internally creating the RDD based by inferring 
> for example from a Json structure. Which is not useful if we're managing the 
> schema externally.
> Also skips the Catalyst conversions and corresponding object overhead, making 
> the internal rows generation much more efficient, by being done explicitly 
> from the caller.
>  
> I will create a corresponding branch for PR review, assuming that there are 
> no concerns with the above proposals.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-04 Thread kk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kk updated SPARK-38115:
---
Description: No default spark conf or param to control the '_temporary' 
path when writing to filesystem.  (was: There is default spark conf or param to 
control the '_temporary' path when writing to filesystem.)

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Major
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-04 Thread Karthik (Jira)
Karthik created SPARK-38115:
---

 Summary: No spark conf to control the path of _temporary when 
writing to target filesystem
 Key: SPARK-38115
 URL: https://issues.apache.org/jira/browse/SPARK-38115
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, Spark Shell, Spark Submit
Affects Versions: 3.2.1, 2.4.8
Reporter: Karthik


There is default spark conf or param to control the '_temporary' path when 
writing to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38114) Spark build fails in Windows

2022-02-04 Thread SOUVIK PAUL (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SOUVIK PAUL updated SPARK-38114:

Description: 
java.lang.NoSuchMethodError: 
org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)

 

A similar issue is being faced by the quarkus project with latest Maven. 

[https://github.com/quarkusio/quarkus/issues/19491]

 

Upgrading the scala-maven-plugin seems to resolve the issue but this ticket can 
be a blocker

https://issues.apache.org/jira/browse/SPARK-36547

  was:
java.lang.NoSuchMethodError: 
org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)

 

A similar issue is being faced by the quarkus project with latest Maven. 

[https://github.com/quarkusio/quarkus/issues/19491]

 

Upgrading the scala-maven-plugin seems to resolve the issue

https://issues.apache.org/jira/browse/SPARK-36547


> Spark build fails in Windows
> 
>
> Key: SPARK-38114
> URL: https://issues.apache.org/jira/browse/SPARK-38114
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3
>Reporter: SOUVIK PAUL
>Priority: Major
>
> java.lang.NoSuchMethodError: 
> org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
> jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
> jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)
>  
> A similar issue is being faced by the quarkus project with latest Maven. 
> [https://github.com/quarkusio/quarkus/issues/19491]
>  
> Upgrading the scala-maven-plugin seems to resolve the issue but this ticket 
> can be a blocker
> https://issues.apache.org/jira/browse/SPARK-36547



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38114) Spark build fails in Windows

2022-02-04 Thread SOUVIK PAUL (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SOUVIK PAUL updated SPARK-38114:

Description: 
java.lang.NoSuchMethodError: 
org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)

 

A similar issue is being faced by the quarkus project with latest Maven. 

[https://github.com/quarkusio/quarkus/issues/19491]

 

Upgrading the scala-maven-plugin seems to resolve the issue

https://issues.apache.org/jira/browse/SPARK-36547

  was:
java.lang.NoSuchMethodError: 
org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)

 

A similar issue is being faced by the quarkus project with latest Maven. 

https://github.com/quarkusio/quarkus/issues/19491


> Spark build fails in Windows
> 
>
> Key: SPARK-38114
> URL: https://issues.apache.org/jira/browse/SPARK-38114
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3
>Reporter: SOUVIK PAUL
>Priority: Major
>
> java.lang.NoSuchMethodError: 
> org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
> jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
> jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)
>  
> A similar issue is being faced by the quarkus project with latest Maven. 
> [https://github.com/quarkusio/quarkus/issues/19491]
>  
> Upgrading the scala-maven-plugin seems to resolve the issue
> https://issues.apache.org/jira/browse/SPARK-36547



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-04 Thread John Crowe (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487287#comment-17487287
 ] 

John Crowe commented on SPARK-37814:


Did you use some sort of Title in your message so that they know you're also a 
dev and have customers of our own?

Regards;
John Crowe



> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2022-02-04 Thread James Inlow (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487286#comment-17487286
 ] 

James Inlow commented on SPARK-6305:


As we wait for spark to be released with log4j v2, how can we know about if 
Spark is effected by any other more recent CVEs impacting log4j 1.x?

Specifically:
 *  [https://nvd.nist.gov/vuln/detail/CVE-2022-23307]
 * [https://nvd.nist.gov/vuln/detail/CVE-2022-23305]

Not sure the correct platform to ask these questions?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Tal Sliwowicz
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-04 Thread Stephen L. De Rudder (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487285#comment-17487285
 ] 

Stephen L. De Rudder commented on SPARK-37814:
--

With log4j 1.x line having several CVEs reported against it too; please 
consider doing one (or both) of the following:
 * Consider porting this to the 3.2 line and releasing a Spark 3.2.2 to address 
the log4j CVEs sooner
 * Consider expediting the 3.3.0 release to address the log4j CVEs

Log4j 1.x CVEs info:
[logging-log4j1/README.md at main · apache/logging-log4j1 · 
GitHub|https://github.com/apache/logging-log4j1/blob/main/README.md#unfixed-vulnerabilities]

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2022-02-04 Thread James Inlow (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487279#comment-17487279
 ] 

James Inlow commented on SPARK-37630:
-

[~pj.fanning] Thanks, I have seen that Spark has switched to log4jv2, but since 
the release won't be for a few months, I am looking for how to identify if the 
current release of Spark is OK as new CVE's are released relating to log4j v1 
until that time.

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38098) Add support for ArrayType of nested StructType to arrow-based conversion

2022-02-04 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-38098:

Summary: Add support for ArrayType of nested StructType to arrow-based 
conversion  (was: Support Array of Struct for Pandas UDFs)

> Add support for ArrayType of nested StructType to arrow-based conversion
> 
>
> Key: SPARK-38098
> URL: https://issues.apache.org/jira/browse/SPARK-38098
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Priority: Minor
>
> This is to allow Pandas UDFs (and mapInArrow UDFs) to operate on columns of 
> type Array of Struct via arrow serialization.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38098) Add support for ArrayType of nested StructType to arrow-based conversion

2022-02-04 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-38098:

Description: 
This proposes to add support for ArrayType of nested StructType to arrow-based 
conversion.
This allows Pandas UDFs, mapInArrow UDFs, and toPandas to operate on columns of 
type Array of Struct, via arrow serialization.

  was:This is to allow Pandas UDFs (and mapInArrow UDFs) to operate on columns 
of type Array of Struct via arrow serialization.


> Add support for ArrayType of nested StructType to arrow-based conversion
> 
>
> Key: SPARK-38098
> URL: https://issues.apache.org/jira/browse/SPARK-38098
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to add support for ArrayType of nested StructType to 
> arrow-based conversion.
> This allows Pandas UDFs, mapInArrow UDFs, and toPandas to operate on columns 
> of type Array of Struct, via arrow serialization.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2022-02-04 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487275#comment-17487275
 ] 

PJ Fanning commented on SPARK-37630:


[~jinlow] there is little point commenting on this closed issue - please look 
at https://issues.apache.org/jira/browse/SPARK-6305 - this issue is marked as a 
duplicate of that and progress has been made on the switch to log4jv2

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2022-02-04 Thread James Inlow (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487271#comment-17487271
 ] 

James Inlow commented on SPARK-37630:
-

How can we know about if Spark is impacted by any other more recent CVEs 
impacting log4j 1.x?

Specifically:
 *  [https://nvd.nist.gov/vuln/detail/CVE-2022-23307]
 * [https://nvd.nist.gov/vuln/detail/CVE-2022-23305]

Not sure the correct platform to ask these questions?

 

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38114) Spark build fails in Windows

2022-02-04 Thread SOUVIK PAUL (Jira)
SOUVIK PAUL created SPARK-38114:
---

 Summary: Spark build fails in Windows
 Key: SPARK-38114
 URL: https://issues.apache.org/jira/browse/SPARK-38114
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3
Reporter: SOUVIK PAUL


java.lang.NoSuchMethodError: 
org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)

 

A similar issue is being faced by the quarkus project with latest Maven. 

https://github.com/quarkusio/quarkus/issues/19491



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38101) MetadataFetchFailedException due to decommission block migrations

2022-02-04 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487258#comment-17487258
 ] 

L. C. Hsieh commented on SPARK-38101:
-

Thanks for reporting this issue, [~eejbyfeldt].



> MetadataFetchFailedException due to decommission block migrations
> -
>
> Key: SPARK-38101
> URL: https://issues.apache.org/jira/browse/SPARK-38101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.1.3, 3.2.1, 3.3.0, 3.2.2
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> As noted in SPARK-34939 there is race when using broadcast for map output 
> status. Explanation from SPARK-34939
> > After map statuses are broadcasted and the executors obtain serialized 
> > broadcasted map statuses. If any fetch failure happens after, Spark 
> > scheduler invalidates cached map statuses and destroy broadcasted value of 
> > the map statuses. Then any executor trying to deserialize serialized 
> > broadcasted map statuses and access broadcasted value, IOException will be 
> > thrown. Currently we don't catch it in MapOutputTrackerWorker and above 
> > exception will fail the application.
> But if running with `spark.decommission.enabled=true` and 
> `spark.storage.decommission.shuffleBlocks.enabled=true` there is another way 
> to hit this race, when a node is decommissioning and the shuffle blocks are 
> migrated. After a block has been migrated an update will be sent to the 
> driver for each block and the map output caches will be invalidated.
> Here are a driver when we hit the race condition running with spark 3.2.0:
> {code:java}
> 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 stored as 
> values in memory (estimated size 5.5 MiB, free 11.0 GiB)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for 
> 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, None)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for 
> 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, None)
> 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output for 
> 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, None)
> 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output for 
> 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for 
> 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for 
> 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None)
> 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output for 
> 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, None)
> 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block broadcast_27_piece0 
> stored as bytes in memory (estimated size 4.0 MiB, free 10.9 GiB)
> 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added 
> broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 4.0 
> MiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block broadcast_27_piece1 
> stored as bytes in memory (estimated size 1520.4 KiB, free 10.9 GiB)
> 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added 
> broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 
> 1520.4 KiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast outputstatuses 
> size = 416, actual size = 5747443
> 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output for 
> 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, None)
> 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying 
> Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added rdd_65_20310 on 
> disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed 
> broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory (size: 4.0 
> MiB, free: 11.0 GiB)
> {code}
> While the Broadcast is being constructed we have updates coming in and the 
> broadcast is destroyed almost immediately. On this particular job we ended up 
> hitting the race condition a lot of times and it caused ~18 task failures and 
> stage retries within 20 seconds causing us to hit our stage retry limit and 
> the job to fail.
> As far I understand this was the expected behavior for handling this case 
> after SPARK-34939. But it seems like when combined with decommissioning 
> hitting the race is a bit too common.
> We have observed this behavior running 3.2.0 and 3.2.1, 

[jira] [Updated] (SPARK-38101) MetadataFetchFailedException due to decommission block migrations

2022-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38101:
--
Affects Version/s: 3.3.0
   (was: 3.3)

> MetadataFetchFailedException due to decommission block migrations
> -
>
> Key: SPARK-38101
> URL: https://issues.apache.org/jira/browse/SPARK-38101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.1.3, 3.2.1, 3.3.0, 3.2.2
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> As noted in SPARK-34939 there is race when using broadcast for map output 
> status. Explanation from SPARK-34939
> > After map statuses are broadcasted and the executors obtain serialized 
> > broadcasted map statuses. If any fetch failure happens after, Spark 
> > scheduler invalidates cached map statuses and destroy broadcasted value of 
> > the map statuses. Then any executor trying to deserialize serialized 
> > broadcasted map statuses and access broadcasted value, IOException will be 
> > thrown. Currently we don't catch it in MapOutputTrackerWorker and above 
> > exception will fail the application.
> But if running with `spark.decommission.enabled=true` and 
> `spark.storage.decommission.shuffleBlocks.enabled=true` there is another way 
> to hit this race, when a node is decommissioning and the shuffle blocks are 
> migrated. After a block has been migrated an update will be sent to the 
> driver for each block and the map output caches will be invalidated.
> Here are a driver when we hit the race condition running with spark 3.2.0:
> {code:java}
> 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 stored as 
> values in memory (estimated size 5.5 MiB, free 11.0 GiB)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for 
> 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, None)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for 
> 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, None)
> 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output for 
> 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, None)
> 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output for 
> 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for 
> 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for 
> 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None)
> 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output for 
> 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, None)
> 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block broadcast_27_piece0 
> stored as bytes in memory (estimated size 4.0 MiB, free 10.9 GiB)
> 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added 
> broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 4.0 
> MiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block broadcast_27_piece1 
> stored as bytes in memory (estimated size 1520.4 KiB, free 10.9 GiB)
> 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added 
> broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 
> 1520.4 KiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast outputstatuses 
> size = 416, actual size = 5747443
> 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output for 
> 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, None)
> 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying 
> Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added rdd_65_20310 on 
> disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed 
> broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory (size: 4.0 
> MiB, free: 11.0 GiB)
> {code}
> While the Broadcast is being constructed we have updates coming in and the 
> broadcast is destroyed almost immediately. On this particular job we ended up 
> hitting the race condition a lot of times and it caused ~18 task failures and 
> stage retries within 20 seconds causing us to hit our stage retry limit and 
> the job to fail.
> As far I understand this was the expected behavior for handling this case 
> after SPARK-34939. But it seems like when combined with decommissioning 
> hitting the race is a bit too common.
> We have observed this behavior running 3.2.0 and 3.2.1, but I think other 
> 

[jira] [Commented] (SPARK-38101) MetadataFetchFailedException due to decommission block migrations

2022-02-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487246#comment-17487246
 ] 

Dongjoon Hyun commented on SPARK-38101:
---

Thank you for filing a JIRA, [~eejbyfeldt].

> MetadataFetchFailedException due to decommission block migrations
> -
>
> Key: SPARK-38101
> URL: https://issues.apache.org/jira/browse/SPARK-38101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.1.3, 3.2.1, 3.2.2, 3.3
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> As noted in SPARK-34939 there is race when using broadcast for map output 
> status. Explanation from SPARK-34939
> > After map statuses are broadcasted and the executors obtain serialized 
> > broadcasted map statuses. If any fetch failure happens after, Spark 
> > scheduler invalidates cached map statuses and destroy broadcasted value of 
> > the map statuses. Then any executor trying to deserialize serialized 
> > broadcasted map statuses and access broadcasted value, IOException will be 
> > thrown. Currently we don't catch it in MapOutputTrackerWorker and above 
> > exception will fail the application.
> But if running with `spark.decommission.enabled=true` and 
> `spark.storage.decommission.shuffleBlocks.enabled=true` there is another way 
> to hit this race, when a node is decommissioning and the shuffle blocks are 
> migrated. After a block has been migrated an update will be sent to the 
> driver for each block and the map output caches will be invalidated.
> Here are a driver when we hit the race condition running with spark 3.2.0:
> {code:java}
> 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 stored as 
> values in memory (estimated size 5.5 MiB, free 11.0 GiB)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for 
> 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, None)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for 
> 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, None)
> 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output for 
> 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, None)
> 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output for 
> 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for 
> 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for 
> 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None)
> 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output for 
> 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, None)
> 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block broadcast_27_piece0 
> stored as bytes in memory (estimated size 4.0 MiB, free 10.9 GiB)
> 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added 
> broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 4.0 
> MiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block broadcast_27_piece1 
> stored as bytes in memory (estimated size 1520.4 KiB, free 10.9 GiB)
> 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added 
> broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 
> 1520.4 KiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast outputstatuses 
> size = 416, actual size = 5747443
> 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output for 
> 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, None)
> 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying 
> Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added rdd_65_20310 on 
> disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed 
> broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory (size: 4.0 
> MiB, free: 11.0 GiB)
> {code}
> While the Broadcast is being constructed we have updates coming in and the 
> broadcast is destroyed almost immediately. On this particular job we ended up 
> hitting the race condition a lot of times and it caused ~18 task failures and 
> stage retries within 20 seconds causing us to hit our stage retry limit and 
> the job to fail.
> As far I understand this was the expected behavior for handling this case 
> after SPARK-34939. But it seems like when combined with decommissioning 
> hitting the race is a bit too common.
> We have observed this behavior running 3.2.0 and 3.2.1, but I 

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2022-02-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487244#comment-17487244
 ] 

Dongjoon Hyun commented on SPARK-6305:
--

This is a part of Apache Spark 3.3 and we will start to vote the release 
candidate on April. Please see the community release plan.
- https://spark.apache.org/versioning-policy.html

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Tal Sliwowicz
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487235#comment-17487235
 ] 

Anton Okolnychyi commented on SPARK-36665:
--

Thanks for the prompt reply, [~kazuyukitanimura]!

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38113) Use error classes in the execution errors of pivoting

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38113:


 Summary: Use error classes in the execution errors of pivoting
 Key: SPARK-38113
 URL: https://issues.apache.org/jira/browse/SPARK-38113
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryExecutionErrors:
* repeatedPivotsUnsupportedError
* pivotNotAfterGroupByUnsupportedError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38112:


 Summary: Use error classes in the execution errors of 
date/timestamp handling
 Key: SPARK-38112
 URL: https://issues.apache.org/jira/browse/SPARK-38112
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryExecutionErrors:
* sparkUpgradeInReadingDatesError
* sparkUpgradeInWritingDatesError
* timeZoneIdNotSpecifiedForTimestampTypeError
* cannotConvertOrcTimestampToTimestampNTZError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487222#comment-17487222
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

Understood, thank you [~aokolnychyi] 

I am preparing a fix. I am sorry for the inconvenience.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487200#comment-17487200
 ] 

Anton Okolnychyi edited comment on SPARK-36665 at 2/4/22, 5:47 PM:
---

[~kazuyukitanimura] {{RewritePredicateSubquery}} still rewrites the predicate 
subquery but it is IN subquery instead of NOT IN. In SQL, NOT IN subqueries 
have to be treated in a special way. As a result, we are getting wrong query 
results right now.


was (Author: aokolnychyi):
[~kazuyukitanimura] {{RewritePredicateSubquery}} still rewrites the predicate 
subquery but it is IN subquery instead of NOT IN. In SQL, NOT IN subquery have 
to be treated in a special way. As a result, we are getting wrong query results 
right now.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487200#comment-17487200
 ] 

Anton Okolnychyi commented on SPARK-36665:
--

[~kazuyukitanimura] {{RewritePredicateSubquery}} still rewrites the predicate 
subquery but it is IN subquery instead of NOT IN. In SQL, NOT IN subquery have 
to be treated in a special way. As a result, we are getting wrong query results 
right now.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches

2022-02-04 Thread Fabien (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabien updated SPARK-38111:
---
Labels: arrow  (was: )

> Retrieve a Spark dataframe as Arrow batches
> ---
>
> Key: SPARK-38111
> URL: https://issues.apache.org/jira/browse/SPARK-38111
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 3.2.0
> Environment: Java 11
> Spark 3
>Reporter: Fabien
>Priority: Minor
>  Labels: arrow
>
> Using the Java API, is there a way to efficiently retrieve a dataframe as 
> Arrow batches ?
> I have a pretty large dataset on my cluster so I cannot collect it using 
> [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
>  which download every thing at once and saturate my JVM memory
> Seeing that Arrow is becoming a standard to transfer large datasets and that 
> Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with 
> Arrow batches ?
> This would be ideal to process the data batch per batch and avoid saturating 
> the memory.
>  
> I am looking for an API like this (in Java)
>  
> {code:java}
> var stream = dataframe.collectAsArrowStream()
> while (stream.hasNextBatch()) {
> var batch = stream.getNextBatch()
> // do some stuff with the arrow batch
> }
> {code}
> It would be even better if I can split the dataframe into several streams so 
> I can download and process it in parallel



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches

2022-02-04 Thread Fabien (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabien updated SPARK-38111:
---
Description: 
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow 
batches ?

I have a pretty large dataset on my cluster so I cannot collect it using 
[collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
 which download every thing at once and saturate my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that 
Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with 
Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating 
the memory.
 

I am looking for an API like this (in Java)

 
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
var batch = stream.getNextBatch()
// do some stuff with the arrow batch
}
{code}

It would be even better if I can split the dataframe into several streams so I 
can download and process it in parallel

  was:
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow 
batches ?

I have a pretty large dataset on my cluster so I cannot collect it using 
[collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
 which download every thing at once and saturate the my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that 
Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with 
Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating 
the memory.
 

I am looking for an API like this (in Java)

 
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
var batch = stream.getNextBatch()
// do some stuff with the arrow batch
}
{code}

It would be even better if I can split the dataframe into several streams so I 
can download and process it in parallel


> Retrieve a Spark dataframe as Arrow batches
> ---
>
> Key: SPARK-38111
> URL: https://issues.apache.org/jira/browse/SPARK-38111
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 3.2.0
> Environment: Java 11
> Spark 3
>Reporter: Fabien
>Priority: Minor
>
> Using the Java API, is there a way to efficiently retrieve a dataframe as 
> Arrow batches ?
> I have a pretty large dataset on my cluster so I cannot collect it using 
> [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
>  which download every thing at once and saturate my JVM memory
> Seeing that Arrow is becoming a standard to transfer large datasets and that 
> Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with 
> Arrow batches ?
> This would be ideal to process the data batch per batch and avoid saturating 
> the memory.
>  
> I am looking for an API like this (in Java)
>  
> {code:java}
> var stream = dataframe.collectAsArrowStream()
> while (stream.hasNextBatch()) {
> var batch = stream.getNextBatch()
> // do some stuff with the arrow batch
> }
> {code}
> It would be even better if I can split the dataframe into several streams so 
> I can download and process it in parallel



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches

2022-02-04 Thread Fabien (Jira)
Fabien created SPARK-38111:
--

 Summary: Retrieve a Spark dataframe as Arrow batches
 Key: SPARK-38111
 URL: https://issues.apache.org/jira/browse/SPARK-38111
 Project: Spark
  Issue Type: Question
  Components: Java API
Affects Versions: 3.2.0
 Environment: Java 11

Spark 3
Reporter: Fabien


Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow 
batches ?

I have a pretty large dataset on my cluster so I cannot collect it using 
[collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
 which download every thing at once and saturate the my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that 
Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with 
Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating 
the memory.
 

I am looking for an API like this (in Java)

 
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
var batch = stream.getNextBatch()
// do some stuff with the arrow batch
}
{code}

It would be even better if I can split the dataframe into several streams so I 
can download and process it in parallel



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38082) Update minimum numpy version

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38082:


Assignee: (was: Apache Spark)

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38082) Update minimum numpy version

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38082:


Assignee: Apache Spark

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38082) Update minimum numpy version

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487180#comment-17487180
 ] 

Apache Spark commented on SPARK-38082:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35398

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38110) Use error classes in the compilation errors of windows

2022-02-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38110:
-
Description: 
Migrate the following errors in QueryCompilationErrors:
* windowSpecificationNotDefinedError
* windowAggregateFunctionWithFilterNotSupportedError
* windowFunctionInsideAggregateFunctionNotAllowedError
* expressionWithoutWindowExpressionError
* expressionWithMultiWindowExpressionsError
* windowFunctionNotAllowedError
* cannotSpecifyWindowFrameError
* windowFrameNotMatchRequiredFrameError
* windowFunctionWithWindowFrameNotOrderedError
* multiTimeWindowExpressionsNotSupportedError
* sessionWindowGapDurationDataTypeError
* invalidLiteralForWindowDurationError
* emptyWindowExpressionError
* foundDifferentWindowFunctionTypeError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.

*Feel free to split this to sub-tasks.*

  was:
Migrate the following errors in QueryCompilationErrors:
* windowSpecificationNotDefinedError
* windowAggregateFunctionWithFilterNotSupportedError
* windowFunctionInsideAggregateFunctionNotAllowedError
* expressionWithoutWindowExpressionError
* expressionWithMultiWindowExpressionsError
* windowFunctionNotAllowedError
* cannotSpecifyWindowFrameError
* windowFrameNotMatchRequiredFrameError
* windowFunctionWithWindowFrameNotOrderedError
* multiTimeWindowExpressionsNotSupportedError
* sessionWindowGapDurationDataTypeError
* invalidLiteralForWindowDurationError
* emptyWindowExpressionError
* foundDifferentWindowFunctionTypeError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.


> Use error classes in the compilation errors of windows
> --
>
> Key: SPARK-38110
> URL: https://issues.apache.org/jira/browse/SPARK-38110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * windowSpecificationNotDefinedError
> * windowAggregateFunctionWithFilterNotSupportedError
> * windowFunctionInsideAggregateFunctionNotAllowedError
> * expressionWithoutWindowExpressionError
> * expressionWithMultiWindowExpressionsError
> * windowFunctionNotAllowedError
> * cannotSpecifyWindowFrameError
> * windowFrameNotMatchRequiredFrameError
> * windowFunctionWithWindowFrameNotOrderedError
> * multiTimeWindowExpressionsNotSupportedError
> * sessionWindowGapDurationDataTypeError
> * invalidLiteralForWindowDurationError
> * emptyWindowExpressionError
> * foundDifferentWindowFunctionTypeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.
> *Feel free to split this to sub-tasks.*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38110) Use error classes in the compilation errors of windows

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38110:


 Summary: Use error classes in the compilation errors of windows
 Key: SPARK-38110
 URL: https://issues.apache.org/jira/browse/SPARK-38110
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryCompilationErrors:
* windowSpecificationNotDefinedError
* windowAggregateFunctionWithFilterNotSupportedError
* windowFunctionInsideAggregateFunctionNotAllowedError
* expressionWithoutWindowExpressionError
* expressionWithMultiWindowExpressionsError
* windowFunctionNotAllowedError
* cannotSpecifyWindowFrameError
* windowFrameNotMatchRequiredFrameError
* windowFunctionWithWindowFrameNotOrderedError
* multiTimeWindowExpressionsNotSupportedError
* sessionWindowGapDurationDataTypeError
* invalidLiteralForWindowDurationError
* emptyWindowExpressionError
* foundDifferentWindowFunctionTypeError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1

2022-02-04 Thread ss (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ss updated SPARK-38109:
---
Description: 
The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

{{replace_dict = \{'wrong': 'right'}}}

{{df = spark.createDataFrame(}}
{{  [['wrong', 'wrong']], }}
{{  schema=['case_matched', 'case_unmatched']}}
{{)}}
{{df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])}}

 

In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:
|case_matched|case_unmatched|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:
|case_matched|case_unmatched|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner.

  was:
The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

{{
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
}}
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 


> pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 
> but not in 3.1
> --
>
> Key: SPARK-38109
> URL: https://issues.apache.org/jira/browse/SPARK-38109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1
>Reporter: ss
>Priority: Minor
>
> The `subset` argument for `DataFrame.replace()` accepts one or more column 
> names. In pyspark 3.2 the case of the column names must match the column 
> names in the schema exactly or the replacements will not take place. In 
> earlier versions (3.1.2 was tested) the argument is case insensitive.
> Minimal example:
> {{replace_dict = \{'wrong': 'right'}}}
> {{df = spark.createDataFrame(}}
> {{  [['wrong', 'wrong']], }}
> {{  schema=['case_matched', 'case_unmatched']}}
> {{)}}
> {{df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])}}
>  
> In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on 
> Databricks) the result is:
> |case_matched|case_unmatched|
> |right|wrong|
> While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on 
> Databricks) the result is:
> |case_matched|case_unmatched|
> |right|right|
> I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
> situations column names are accepted in a case insensitive manner.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1

2022-02-04 Thread ss (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ss updated SPARK-38109:
---
Description: 
The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

{{
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
}}
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 

  was:
The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 


> pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 
> but not in 3.1
> --
>
> Key: SPARK-38109
> URL: https://issues.apache.org/jira/browse/SPARK-38109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1
>Reporter: ss
>Priority: Minor
>
> The `subset` argument for `DataFrame.replace()` accepts one or more column 
> names. In pyspark 3.2 the case of the column names must match the column 
> names in the schema exactly or the replacements will not take place. In 
> earlier versions (3.1.2 was tested) the argument is case insensitive.
> Minimal example:
> {{
> replace_dict = {'wrong': 'right'}
> df = spark.createDataFrame(
>   [['wrong', 'wrong']], 
>   schema=['case_matched', 'case_unmatched']
> )
> df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
> }}
> In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on 
> Databricks) the result is:
> |case_matched|case_unmatched|
> |right|wrong|
> While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on 
> Databricks) the result is:
> |case_matched|case_unmatched|
> |right|right|
> I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
> situations column names are accepted in a case insensitive manner. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1

2022-02-04 Thread ss (Jira)
ss created SPARK-38109:
--

 Summary: pyspark DataFrame.replace() is sensitive to column name 
case in pyspark 3.2 but not in 3.1
 Key: SPARK-38109
 URL: https://issues.apache.org/jira/browse/SPARK-38109
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1, 3.2.0
Reporter: ss


The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|-|-|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|-|-|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1

2022-02-04 Thread ss (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ss updated SPARK-38109:
---
Description: 
The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 

  was:
The `subset` argument for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|-|-|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|-|-|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 


> pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 
> but not in 3.1
> --
>
> Key: SPARK-38109
> URL: https://issues.apache.org/jira/browse/SPARK-38109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1
>Reporter: ss
>Priority: Minor
>
> The `subset` argument for `DataFrame.replace()` accepts one or more column 
> names. In pyspark 3.2 the case of the column names must match the column 
> names in the schema exactly or the replacements will not take place. In 
> earlier versions (3.1.2 was tested) the argument is case insensitive.
> Minimal example:
> ```python
> replace_dict = {'wrong': 'right'}
> df = spark.createDataFrame(
>   [['wrong', 'wrong']], 
>   schema=['case_matched', 'case_unmatched']
> )
> df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
> ```
> In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on 
> Databricks) the result is:
> |case_matched|case_unmatched|
> |right|wrong|
> While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on 
> Databricks) the result is:
> |case_matched|case_unmatched|
> |right|right|
> I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
> situations column names are accepted in a case insensitive manner. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs

2022-02-04 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487170#comment-17487170
 ] 

Max Gekk commented on SPARK-38107:
--

[~hyukjin.kwon] Do you know someone who could be interested in implementing 
this?

> Use error classes in the compilation errors of python/pandas UDFs
> -
>
> Key: SPARK-38107
> URL: https://issues.apache.org/jira/browse/SPARK-38107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * pandasUDFAggregateNotSupportedInPivotError
> * groupAggPandasUDFUnsupportedByStreamingAggError
> * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
> * usePythonUDFInJoinConditionUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs

2022-02-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38107:
-
Summary: Use error classes in the compilation errors of python/pandas UDFs  
(was: Use error classes in the compilation errors of pandas UDFs)

> Use error classes in the compilation errors of python/pandas UDFs
> -
>
> Key: SPARK-38107
> URL: https://issues.apache.org/jira/browse/SPARK-38107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * pandasUDFAggregateNotSupportedInPivotError
> * groupAggPandasUDFUnsupportedByStreamingAggError
> * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs

2022-02-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38107:
-
Description: 
Migrate the following errors in QueryCompilationErrors:
* pandasUDFAggregateNotSupportedInPivotError
* groupAggPandasUDFUnsupportedByStreamingAggError
* cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
* usePythonUDFInJoinConditionUnsupportedError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.

  was:
Migrate the following errors in QueryCompilationErrors:
* pandasUDFAggregateNotSupportedInPivotError
* groupAggPandasUDFUnsupportedByStreamingAggError
* cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.


> Use error classes in the compilation errors of python/pandas UDFs
> -
>
> Key: SPARK-38107
> URL: https://issues.apache.org/jira/browse/SPARK-38107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * pandasUDFAggregateNotSupportedInPivotError
> * groupAggPandasUDFUnsupportedByStreamingAggError
> * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
> * usePythonUDFInJoinConditionUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38108) Use error classes in the compilation errors of UDF/UDAF

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38108:


 Summary: Use error classes in the compilation errors of UDF/UDAF
 Key: SPARK-38108
 URL: https://issues.apache.org/jira/browse/SPARK-38108
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryCompilationErrors:
* noHandlerForUDAFError
* unexpectedEvalTypesForUDFsError
* usingUntypedScalaUDFError
* udfClassDoesNotImplementAnyUDFInterfaceError
* udfClassNotAllowedToImplementMultiUDFInterfacesError
* udfClassWithTooManyTypeArgumentsError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38107) Use error classes in the compilation errors of pandas UDFs

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38107:


 Summary: Use error classes in the compilation errors of pandas UDFs
 Key: SPARK-38107
 URL: https://issues.apache.org/jira/browse/SPARK-38107
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryCompilationErrors:
* pandasUDFAggregateNotSupportedInPivotError
* groupAggPandasUDFUnsupportedByStreamingAggError
* cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487154#comment-17487154
 ] 

Apache Spark commented on SPARK-38102:
--

User 'ocworld' has created a pull request for this issue:
https://github.com/apache/spark/pull/35397

> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
> saveAsNewAPIHadoopDataset that are to avoid object storage's problem.
> [https://spark.apache.org/docs/latest/cloud-integration.html]
>  
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset by an option. For example,
> {code:java}
> spark.hadoop.mapreduce.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487153#comment-17487153
 ] 

Apache Spark commented on SPARK-38102:
--

User 'ocworld' has created a pull request for this issue:
https://github.com/apache/spark/pull/35397

> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
> saveAsNewAPIHadoopDataset that are to avoid object storage's problem.
> [https://spark.apache.org/docs/latest/cloud-integration.html]
>  
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset by an option. For example,
> {code:java}
> spark.hadoop.mapreduce.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38102:


Assignee: (was: Apache Spark)

> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
> saveAsNewAPIHadoopDataset that are to avoid object storage's problem.
> [https://spark.apache.org/docs/latest/cloud-integration.html]
>  
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset by an option. For example,
> {code:java}
> spark.hadoop.mapreduce.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38102:


Assignee: Apache Spark

> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Assignee: Apache Spark
>Priority: Major
>
> There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
> saveAsNewAPIHadoopDataset that are to avoid object storage's problem.
> [https://spark.apache.org/docs/latest/cloud-integration.html]
>  
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset by an option. For example,
> {code:java}
> spark.hadoop.mapreduce.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Keunhyun Oh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keunhyun Oh updated SPARK-38102:

Description: 
There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
saveAsNewAPIHadoopDataset that are to avoid object storage's problem.

[https://spark.apache.org/docs/latest/cloud-integration.html]

 

It is needed to support custom commitProtocolClass class when using 
saveAsNewAPIHadoopDataset by an option. For example,
{code:java}
spark.hadoop.mapreduce.sources.commitProtocolClass 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
 

 

  was:
It is needed to support custom commitProtocolClass class when using 
saveAsNewAPIHadoopDataset.

It's because of no way to apply spark-hadoop-cloud's commitProtocolClass when 
using saveAsNewAPIHadoopDataset that are to avoid object storage's problem.


> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
> saveAsNewAPIHadoopDataset that are to avoid object storage's problem.
> [https://spark.apache.org/docs/latest/cloud-integration.html]
>  
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset by an option. For example,
> {code:java}
> spark.hadoop.mapreduce.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38106) Use error classes in the parsing errors of functions

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38106:


 Summary: Use error classes in the parsing errors of functions
 Key: SPARK-38106
 URL: https://issues.apache.org/jira/browse/SPARK-38106
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryParsingErrors:
* functionNameUnsupportedError
* showFunctionsUnsupportedError
* showFunctionsInvalidPatternError
* createFuncWithBothIfNotExistsAndReplaceError
* defineTempFuncWithIfNotExistsError
* unsupportedFunctionNameError
* specifyingDBInCreateTempFuncError
* invalidNameForDropTempFunc

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38105) Use error classes in the parsing errors of joins

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38105:


 Summary: Use error classes in the parsing errors of joins
 Key: SPARK-38105
 URL: https://issues.apache.org/jira/browse/SPARK-38105
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryParsingErrors:
* joinCriteriaUnimplementedError
* naturalCrossJoinUnsupportedError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38104) Use error classes in the parsing errors of windows

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38104:


 Summary: Use error classes in the parsing errors of windows
 Key: SPARK-38104
 URL: https://issues.apache.org/jira/browse/SPARK-38104
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryParsingErrors:
* repetitiveWindowDefinitionError
* invalidWindowReferenceError
* cannotResolveWindowReferenceError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38103) Use error classes in the parsing errors of transform

2022-02-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-38103:


 Summary: Use error classes in the parsing errors of transform
 Key: SPARK-38103
 URL: https://issues.apache.org/jira/browse/SPARK-38103
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Migrate the following errors in QueryParsingErrors:
* transformNotSupportQuantifierError
* transformWithSerdeUnsupportedError
* tooManyArgumentsForTransformError
* notEnoughArgumentsForTransformError
* invalidTransformArgumentError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2022-02-04 Thread Rob D (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487134#comment-17487134
 ] 

Rob D commented on SPARK-6305:
--

Thank you for this update. Has this change been included in an official release 
yet?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Tal Sliwowicz
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Keunhyun Oh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keunhyun Oh updated SPARK-38102:

Summary: Supporting custom commitProtocolClass when using 
saveAsNewAPIHadoopDataset  (was: Supporting custom commitProtocolClass and 
committer class when using saveAsNewAPIHadoopDataset)

> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> It is needed to support custom commitProtocolClass and committer class when 
> using saveAsNewAPIHadoopDataset.
> It's because of no way to apply spark-hadoop-cloud's classes 
> commitProtocolClass and committer when using saveAsNewAPIHadoopDataset that 
> are to avoid object storage's problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Keunhyun Oh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keunhyun Oh updated SPARK-38102:

Description: 
It is needed to support custom commitProtocolClass class when using 
saveAsNewAPIHadoopDataset.

It's because of no way to apply spark-hadoop-cloud's commitProtocolClass when 
using saveAsNewAPIHadoopDataset that are to avoid object storage's problem.

  was:
It is needed to support custom commitProtocolClass and committer class when 
using saveAsNewAPIHadoopDataset.

It's because of no way to apply spark-hadoop-cloud's classes 
commitProtocolClass and committer when using saveAsNewAPIHadoopDataset that are 
to avoid object storage's problem.


> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset.
> It's because of no way to apply spark-hadoop-cloud's commitProtocolClass when 
> using saveAsNewAPIHadoopDataset that are to avoid object storage's problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38102) Supporting custom commitProtocolClass and committer class when using saveAsNewAPIHadoopDataset

2022-02-04 Thread Keunhyun Oh (Jira)
Keunhyun Oh created SPARK-38102:
---

 Summary: Supporting custom commitProtocolClass and committer class 
when using saveAsNewAPIHadoopDataset
 Key: SPARK-38102
 URL: https://issues.apache.org/jira/browse/SPARK-38102
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Keunhyun Oh


It is needed to support custom commitProtocolClass and committer class when 
using saveAsNewAPIHadoopDataset.

It's because of no way to apply spark-hadoop-cloud's classes 
commitProtocolClass and committer when using saveAsNewAPIHadoopDataset that are 
to avoid object storage's problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487055#comment-17487055
 ] 

Apache Spark commented on SPARK-38073:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35396

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38073:


Assignee: (was: Apache Spark)

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38073:


Assignee: Apache Spark

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org