date:20220817

[jira] [Assigned] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40133:


Assignee: (was: Apache Spark)

> Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is 
> true
> ---
>
> Key: SPARK-40133
> URL: https://issues.apache.org/jira/browse/SPARK-40133
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581160#comment-17581160
 ] 

Apache Spark commented on SPARK-40133:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37562

> Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is 
> true
> ---
>
> Key: SPARK-40133
> URL: https://issues.apache.org/jira/browse/SPARK-40133
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40133:


Assignee: Apache Spark

> Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is 
> true
> ---
>
> Key: SPARK-40133
> URL: https://issues.apache.org/jira/browse/SPARK-40133
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581159#comment-17581159
 ] 

Apache Spark commented on SPARK-40133:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37562

> Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is 
> true
> ---
>
> Key: SPARK-40133
> URL: https://issues.apache.org/jira/browse/SPARK-40133
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true

2022-08-17 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-40133:
---

 Summary: Regenerate excludedTpcdsQueries's golden files if 
regenerateGoldenFiles is true
 Key: SPARK-40133
 URL: https://issues.apache.org/jira/browse/SPARK-40133
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40132.
--
Fix Version/s: 3.4.0
   3.3.1
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37561

> MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
> 
>
> Key: SPARK-40132
> URL: https://issues.apache.org/jira/browse/SPARK-40132
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>
> https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in 
> Pyspark ML's classification.py but inadvertently removed the parameter 
> rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes 
> its constructor to fail when this param is set in the constructor, as it 
> isn't recognized by setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar

2022-08-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40123:
-
Fix Version/s: (was: 3.3.1)

> Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar
> 
>
> Key: SPARK-40123
> URL: https://issues.apache.org/jira/browse/SPARK-40123
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.3.0
>Reporter: manohar
>Priority: Major
>  Labels: security-issue
>
> Hello Team,
> We are facing this vulnerability on Spark Installation 3.3.3 , Can we please 
> upgrade the version of mesos in our installation to address this 
> vulnerability. 
> ||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path||
> |1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, 
> 1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar|
> In our source code i found that the depedant version of mesos jar is 1.4.3 
> user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * 
> core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * 
> TaskSchedulerImpl. We assume a Mesos-like model where the application gets 
> resource offers as
> *dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
> dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
> *



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40115) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40115.
--
Resolution: Invalid

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40115
> URL: https://issues.apache.org/jira/browse/SPARK-40115
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently SparkR tests fail 
> https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387 because 
> SparkR does not support Arrow 9.0.0+, see also SPARK-40114
> We should pin the version to 8.0.0 for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38946) Generates a new dataframe instead of operating inplace in setitem

2022-08-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38946:


Assignee: Yikun Jiang

> Generates a new dataframe instead of operating inplace in setitem
> -
>
> Key: SPARK-38946
> URL: https://issues.apache.org/jira/browse/SPARK-38946
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_eval
> DataFrameTest.test_update
> DataFrameTest.test_inplace
> DataFrameTest.test_fillna{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38946) Generates a new dataframe instead of operating inplace in setitem

2022-08-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38946.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36353
[https://github.com/apache/spark/pull/36353]

> Generates a new dataframe instead of operating inplace in setitem
> -
>
> Key: SPARK-38946
> URL: https://issues.apache.org/jira/browse/SPARK-38946
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> DataFrameTest.test_eval
> DataFrameTest.test_update
> DataFrameTest.test_inplace
> DataFrameTest.test_fillna{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40121) Initialize projection used for Python UDF

2022-08-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40121.
--
Fix Version/s: 3.3.1
   3.1.4
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37552
[https://github.com/apache/spark/pull/37552]

> Initialize projection used for Python UDF
> -
>
> Key: SPARK-40121
> URL: https://issues.apache.org/jira/browse/SPARK-40121
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0
>
>
> {code}
> >>> from pyspark.sql.functions import udf, rand
> >>> spark.range(10).select(udf(lambda x: x)(rand())).show()
> {code}
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40121) Initialize projection used for Python UDF

2022-08-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40121:


Assignee: Hyukjin Kwon

> Initialize projection used for Python UDF
> -
>
> Key: SPARK-40121
> URL: https://issues.apache.org/jira/browse/SPARK-40121
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> >>> from pyspark.sql.functions import udf, rand
> >>> spark.range(10).select(udf(lambda x: x)(rand())).show()
> {code}
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1758#comment-1758
 ] 

Apache Spark commented on SPARK-40132:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37561

> MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
> 
>
> Key: SPARK-40132
> URL: https://issues.apache.org/jira/browse/SPARK-40132
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in 
> Pyspark ML's classification.py but inadvertently removed the parameter 
> rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes 
> its constructor to fail when this param is set in the constructor, as it 
> isn't recognized by setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40132:


Assignee: Apache Spark  (was: Sean R. Owen)

> MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
> 
>
> Key: SPARK-40132
> URL: https://issues.apache.org/jira/browse/SPARK-40132
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in 
> Pyspark ML's classification.py but inadvertently removed the parameter 
> rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes 
> its constructor to fail when this param is set in the constructor, as it 
> isn't recognized by setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40132:


Assignee: Sean R. Owen  (was: Apache Spark)

> MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
> 
>
> Key: SPARK-40132
> URL: https://issues.apache.org/jira/browse/SPARK-40132
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in 
> Pyspark ML's classification.py but inadvertently removed the parameter 
> rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes 
> its constructor to fail when this param is set in the constructor, as it 
> isn't recognized by setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581110#comment-17581110
 ] 

Apache Spark commented on SPARK-40132:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37561

> MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
> 
>
> Key: SPARK-40132
> URL: https://issues.apache.org/jira/browse/SPARK-40132
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in 
> Pyspark ML's classification.py but inadvertently removed the parameter 
> rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes 
> its constructor to fail when this param is set in the constructor, as it 
> isn't recognized by setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Sean R. Owen (Jira)

Sean R. Owen created SPARK-40132:


 Summary: MultilayerPerceptronClassifier rawPredictionCol param 
missing from setParams
 Key: SPARK-40132
 URL: https://issues.apache.org/jira/browse/SPARK-40132
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.3.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in Pyspark 
ML's classification.py but inadvertently removed the parameter rawPredictionCol 
from MultilayerPerceptronClassifier's setParams. This causes its constructor to 
fail when this param is set in the constructor, as it isn't recognized by 
setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40131) Support NumPy ndarray in built-in functions

2022-08-17 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-40131:


 Summary: Support NumPy ndarray in built-in functions
 Key: SPARK-40131
 URL: https://issues.apache.org/jira/browse/SPARK-40131
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Per [https://github.com/apache/spark/pull/37560#discussion_r948572473]
we want to support NumPy ndarray in built-in functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40130) Support NumPy scalars in built-in functions

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581090#comment-17581090
 ] 

Apache Spark commented on SPARK-40130:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37560

> Support NumPy scalars in built-in functions
> ---
>
> Key: SPARK-40130
> URL: https://issues.apache.org/jira/browse/SPARK-40130
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support NumPy scalars in built-in functions by introducing Py4J input 
> converter `NumpyScalarConverter`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40130) Support NumPy scalars in built-in functions

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40130:


Assignee: (was: Apache Spark)

> Support NumPy scalars in built-in functions
> ---
>
> Key: SPARK-40130
> URL: https://issues.apache.org/jira/browse/SPARK-40130
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support NumPy scalars in built-in functions by introducing Py4J input 
> converter `NumpyScalarConverter`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40130) Support NumPy scalars in built-in functions

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581091#comment-17581091
 ] 

Apache Spark commented on SPARK-40130:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37560

> Support NumPy scalars in built-in functions
> ---
>
> Key: SPARK-40130
> URL: https://issues.apache.org/jira/browse/SPARK-40130
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support NumPy scalars in built-in functions by introducing Py4J input 
> converter `NumpyScalarConverter`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40130) Support NumPy scalars in built-in functions

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40130:


Assignee: Apache Spark

> Support NumPy scalars in built-in functions
> ---
>
> Key: SPARK-40130
> URL: https://issues.apache.org/jira/browse/SPARK-40130
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Support NumPy scalars in built-in functions by introducing Py4J input 
> converter `NumpyScalarConverter`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40130) Support NumPy scalars in built-in functions

2022-08-17 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-40130:


 Summary: Support NumPy scalars in built-in functions
 Key: SPARK-40130
 URL: https://issues.apache.org/jira/browse/SPARK-40130
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Support NumPy scalars in built-in functions by introducing Py4J input converter 
`NumpyScalarConverter`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581077#comment-17581077
 ] 

L. C. Hsieh commented on SPARK-40128:
-

Added [~dennishuo] as Spark contributor and assigned this JIRA to him.

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-40128:
---

Assignee: Dennis Huo

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581073#comment-17581073
 ] 

Chao Sun commented on SPARK-40128:
--

Seems we need to add [~dennishuo] as Spark contributor in order to assign him 
the JIRA. [~dongjoon] [~viirya] could you help? 

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-40128.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37557
[https://github.com/apache/spark/pull/37557]

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37544) sequence over dates with month interval is producing incorrect results

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581068#comment-17581068
 ] 

Apache Spark commented on SPARK-37544:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/37559

> sequence over dates with month interval is producing incorrect results
> --
>
> Key: SPARK-37544
> URL: https://issues.apache.org/jira/browse/SPARK-37544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: Ubuntu 20, OSX 11.6
> OpenJDK 11, Spark 3.2
>Reporter: Vsevolod Ostapenko
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.0, 3.2.2
>
>
> Sequence function with dates and step interval in months producing unexpected 
> results.
> Here is a sample using Spark 3.2 (though the behavior is the same in 3.1.1 
> and presumably earlier):
> {{scala> spark.sql("select sequence(date '2021-01-01', date '2022-01-01', 
> interval '3' month) x, date '2021-01-01' + interval '3' month y").collect()}}
> {{res1: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01, 
> {color:#FF}*2021-03-31, 2021-06-30, 2021-09-30,* 
> {color}{color:#172b4d}2022-01-01{color}),2021-04-01])}}
> Expected result of adding 3 months to the 2021-01-01 is 2021-04-01, while 
> sequence returns 2021-03-31.
> At the same time sequence over timestamps works as expected:
> {{scala> spark.sql("select sequence(timestamp '2021-01-01 00:00', timestamp 
> '2022-01-01 00:00', interval '3' month) x").collect()}}
> {{res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01 
> 00:00:00.0, *2021-04-01* 00:00:00.0, *2021-07-01* 00:00:00.0, *2021-10-01* 
> 00:00:00.0, 2022-01-01 00:00:00.0)])}}
>  
> A similar issue was reported in the past - [SPARK-31654] sequence producing 
> inconsistent intervals for month step - ASF JIRA (apache.org)
> It's marked resolved, but the problem is either resurfaced or was never 
> actually fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40110) Add JDBCWithAQESuite

2022-08-17 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-40110:


Assignee: Kazuyuki Tanimura

> Add JDBCWithAQESuite
> 
>
> Key: SPARK-40110
> URL: https://issues.apache.org/jira/browse/SPARK-40110
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
> test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40110) Add JDBCWithAQESuite

2022-08-17 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-40110.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37544
[https://github.com/apache/spark/pull/37544]

> Add JDBCWithAQESuite
> 
>
> Key: SPARK-40110
> URL: https://issues.apache.org/jira/browse/SPARK-40110
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
> test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40109) New SQL function: get()

2022-08-17 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40109.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37541
[https://github.com/apache/spark/pull/37541]

> New SQL function: get()
> ---
>
> Key: SPARK-40109
> URL: https://issues.apache.org/jira/browse/SPARK-40109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, when accessing array element with invalid index under ANSI SQL 
> mode, the error is like:
> {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 
> elements. Use `try_element_at` and increase the array index by 1(the starting 
> array index is 1 for `try_element_at`) to tolerate accessing element at 
> invalid index and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {quote}
> The provided solution is complicated. I suggest introducing a new method 
> get() which always returns null on an invalid array index. This is from 
> [https://docs.snowflake.com/en/sql-reference/functions/get.html.]
> Since Spark's map access always returns null, let's don't support map type in 
> the get method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40129) Decimal multiply can produce the wrong answer because it rounds twice

2022-08-17 Thread Robert Joseph Evans (Jira)

Robert Joseph Evans created SPARK-40129:
---

 Summary: Decimal multiply can produce the wrong answer because it 
rounds twice
 Key: SPARK-40129
 URL: https://issues.apache.org/jira/browse/SPARK-40129
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.4.0
Reporter: Robert Joseph Evans


This looks like it has been around for a long time, but I have reproduced it in 
3.2.0+

The example here is multiplying Decimal(38, 10) by another Decimal(38, 10), but 
I think it can be reproduced with other number combinations, and possibly with 
divide too.
{code:java}
Seq("9173594185998001607642838421.5479932913").toDF.selectExpr("CAST(value as 
DECIMAL(38,10)) as a").selectExpr("a * CAST(-12 as 
DECIMAL(38,10))").show(truncate=false)
{code}
This produces an answer in Spark of {{-110083130231976019291714061058.575920}} 
But if I do the calculation in regular java BigDecimal I get 
{{-110083130231976019291714061058.575919}}
{code:java}
BigDecimal l = new BigDecimal("9173594185998001607642838421.5479932913");
BigDecimal r = new BigDecimal("-12.00");
BigDecimal prod = l.multiply(r);
BigDecimal rounded_prod = prod.setScale(6, RoundingMode.HALF_UP);
{code}
Spark does essentially all of the same operations, but it used Decimal to do it 
instead of java's BigDecimal directly. Spark, by way of Decimal, will set a 
MathContext for the multiply operation that has a max precision of 38 and will 
do half up rounding. That means that the result of the multiply operation in 
Spark is {{{}-110083130231976019291714061058.57591950{}}}, but for the java 
BigDecimal code the result is 
{{{}-110083130231976019291714061058.575919495600{}}}. Then in 
CheckOverflow for 3.2.0 and 3.3.0 or in just the regular Multiply expression in 
3.4.0 the setScale is called (as a part of Decimal.setPrecision). At that point 
the already rounded number is rounded yet again resulting in what is arguably a 
wrong answer by Spark.

I have not fully tested this, but it looks like we could just remove the 
MathContext entirely in Decimal, or set it to UNLIMITED. All of the decimal 
operations appear to have their own overflow and rounding anyways. If we want 
to potentially reduce the total memory usage, we could also set the max 
precision to 39 and truncate (round down) the result in the math context 
instead.  That would then let us round the result correctly in setPrecision 
afterwards.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581011#comment-17581011
 ] 

Apache Spark commented on SPARK-38954:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/37558

> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38954:


Assignee: Apache Spark

> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581009#comment-17581009
 ] 

Apache Spark commented on SPARK-38954:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/37558

> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38954:


Assignee: (was: Apache Spark)

> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated SPARK-40128:
---
Docs Text: Added support for keeping vectorized reads enabled for Parquet 
files using the DELTA_LENGTH_BYTE_ARRAY encoding as a standalone column 
encoding. Previously, the related DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY 
encodings were accepted as column encodings, but DELTA_LENGTH_BYTE_ARRAY would 
still be rejected as "unsupported".

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40063) pyspark.pandas .apply() changing rows ordering

2022-08-17 Thread Marcelo Rossini Castro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580971#comment-17580971
 ] 

Marcelo Rossini Castro edited comment on SPARK-40063 at 8/17/22 7:36 PM:
-

Normally I use the default, {{{}distributed-sequence{}}}, but I already tried 
{{sequence}} too and I get the same error.
So, I tried it again, combining with {{compute.ordered_head}} enabled.

This operation requires me to use {{compute.ops_on_diff_frames}} enabled, I 
think it's worth mentioning.


was (Author: JIRAUSER294354):
Normally I use the default, {{{}distributed-sequence{}}}, but I already tried 
{{sequence}} too and I get the same error.
So, I tried it again, combining with {{compute.ordered_head}} enabled.

I'm having to use {{compute.ops_on_diff_frames}} enabled, I think it's worth 
mentioning.

> pyspark.pandas .apply() changing rows ordering
> --
>
> Key: SPARK-40063
> URL: https://issues.apache.org/jira/browse/SPARK-40063
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.0
> Environment: Databricks Runtime 11.1
>Reporter: Marcelo Rossini Castro
>Priority: Minor
>  Labels: Pandas, PySpark
>
> When using the apply function to apply a function to a DataFrame column, it 
> ends up mixing the column's rows ordering.
> A command like this:
> {code:java}
> def example_func(df_col):
>   return df_col ** 2 
> df['col_to_apply_function'] = df.apply(lambda row: 
> example_func(row['col_to_apply_function']), axis=1) {code}
> A workaround is to assign the results to a new column instead of the same 
> one, but if the old column is dropped, the same error is produced.
> Setting one column as index also didn't work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering

2022-08-17 Thread Marcelo Rossini Castro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580971#comment-17580971
 ] 

Marcelo Rossini Castro commented on SPARK-40063:


Normally I use the default, {{{}distributed-sequence{}}}, but I already tried 
{{sequence}} too and I get the same error.
So, I tried it again, combining with {{compute.ordered_head}} enabled.

I'm having to use {{compute.ops_on_diff_frames}} enabled, I think it's worth 
mentioning.

> pyspark.pandas .apply() changing rows ordering
> --
>
> Key: SPARK-40063
> URL: https://issues.apache.org/jira/browse/SPARK-40063
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.0
> Environment: Databricks Runtime 11.1
>Reporter: Marcelo Rossini Castro
>Priority: Minor
>  Labels: Pandas, PySpark
>
> When using the apply function to apply a function to a DataFrame column, it 
> ends up mixing the column's rows ordering.
> A command like this:
> {code:java}
> def example_func(df_col):
>   return df_col ** 2 
> df['col_to_apply_function'] = df.apply(lambda row: 
> example_func(row['col_to_apply_function']), axis=1) {code}
> A workaround is to assign the results to a new column instead of the same 
> one, but if the old column is dropped, the same error is produced.
> Setting one column as index also didn't work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-08-17 Thread Parth Chandra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580929#comment-17580929
 ] 

Parth Chandra commented on SPARK-38954:
---

Sorry about the delay, should have updated the JIRA. I ran into some testing 
issues but those are now resolved. Getting it ready now. 


> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated SPARK-40128:
---
Description: 
Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

Even though there apparently aren't many writers of the standalone 
DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
[https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
 and could be more efficient for types of binary/string data that don't take 
good advantage of sharing common prefixes for incremental encoding.

The problem can be reproduced by trying to load one of the 
[https://github.com/apache/parquet-testing] files 
(delta_length_byte_array.parquet).

  was:
Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

Even though there apparently aren't many writers of the standalone 
DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
 and could be more efficient for types of binary/string data that don't take 
good advantage of sharing common prefixes for incremental encoding.

The problem and be reproduced by trying to load one of the 
[https://github.com/apache/parquet-testing] files 
(delta_length_byte_array.parquet).


> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580899#comment-17580899
 ] 

Apache Spark commented on SPARK-40128:
--

User 'sfc-gh-dhuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/37557

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem and be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40128:


Assignee: (was: Apache Spark)

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem and be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580898#comment-17580898
 ] 

Apache Spark commented on SPARK-40128:
--

User 'sfc-gh-dhuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/37557

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem and be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40128:


Assignee: Apache Spark

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Assignee: Apache Spark
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem and be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated SPARK-40128:
---
Attachment: delta_length_byte_array.parquet

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem and be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)

Dennis Huo created SPARK-40128:
--

 Summary: Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone 
encoding in VectorizedColumnReader
 Key: SPARK-40128
 URL: https://issues.apache.org/jira/browse/SPARK-40128
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Dennis Huo


Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

Even though there apparently aren't many writers of the standalone 
DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
 and could be more efficient for types of binary/string data that don't take 
good advantage of sharing common prefixes for incremental encoding.

The problem and be reproduced by trying to load one of the 
[https://github.com/apache/parquet-testing] files 
(delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40127) FaultToleranceTest should in test dir

2022-08-17 Thread Yang Jie (Jira)

Yang Jie created SPARK-40127:


 Summary: FaultToleranceTest should in test dir
 Key: SPARK-40127
 URL: https://issues.apache.org/jira/browse/SPARK-40127
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


FaultToleranceTest in core module src dir and it was not tested using GA



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39799) DataSourceV2: View catalog interface

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580873#comment-17580873
 ] 

Apache Spark commented on SPARK-39799:
--

User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/37556

> DataSourceV2: View catalog interface
> 
>
> Key: SPARK-39799
> URL: https://issues.apache.org/jira/browse/SPARK-39799
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> The view catalog interfaces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39799) DataSourceV2: View catalog interface

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39799:


Assignee: Apache Spark

> DataSourceV2: View catalog interface
> 
>
> Key: SPARK-39799
> URL: https://issues.apache.org/jira/browse/SPARK-39799
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> The view catalog interfaces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39799) DataSourceV2: View catalog interface

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39799:


Assignee: (was: Apache Spark)

> DataSourceV2: View catalog interface
> 
>
> Key: SPARK-39799
> URL: https://issues.apache.org/jira/browse/SPARK-39799
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> The view catalog interfaces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39799) DataSourceV2: View catalog interface

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580872#comment-17580872
 ] 

Apache Spark commented on SPARK-39799:
--

User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/37556

> DataSourceV2: View catalog interface
> 
>
> Key: SPARK-39799
> URL: https://issues.apache.org/jira/browse/SPARK-39799
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> The view catalog interfaces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40126) Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability

2022-08-17 Thread Jason Tan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Tan updated SPARK-40126:
--
Description: 
Dear Spark Team,

Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
discovered the following vulnerability/scan results within the image :

      Type:            VULNERABILITY

      Name:            DSA-5169-1

      CVSS Score v3:   9.8

      Severity:        critical

The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
upgrade the version of openssl to 1.1.1n-0+deb11u3

Steps to reproduce:
Install trivy [https://aquasecurity.github.io/trivy/v0.18.3/installation/]
trivy image docker.io/apache/spark:v3.3.0

  was:
Dear Spark Team,

Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
discovered the following vulnerability/scan results within the image :



      Type:            VULNERABILITY

      Name:            DSA-5169-1

      CVSS Score v3:   9.8

      Severity:        critical

The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
upgrade the version of openssl to 1.1.1n-0+deb11u3


> Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical 
> vulnerability
> 
>
> Key: SPARK-40126
> URL: https://issues.apache.org/jira/browse/SPARK-40126
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jason Tan
>Priority: Major
>
> Dear Spark Team,
> Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
> discovered the following vulnerability/scan results within the image :
>       Type:            VULNERABILITY
>       Name:            DSA-5169-1
>       CVSS Score v3:   9.8
>       Severity:        critical
> The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
> upgrade the version of openssl to 1.1.1n-0+deb11u3
> Steps to reproduce:
> Install trivy [https://aquasecurity.github.io/trivy/v0.18.3/installation/]
> trivy image docker.io/apache/spark:v3.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40126) Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability

2022-08-17 Thread Jason Tan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Tan updated SPARK-40126:
--
Summary: Security scanning spark v3.3.0 docker image results in DSA-5169-1 
critical vulnerability  (was: Security scanning spark v3.3.0 results in 
DSA-5169-1 critical vulnerability)

> Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical 
> vulnerability
> 
>
> Key: SPARK-40126
> URL: https://issues.apache.org/jira/browse/SPARK-40126
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jason Tan
>Priority: Major
>
> Dear Spark Team,
> Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
> discovered the following vulnerability/scan results within the image :
>       Type:            VULNERABILITY
>       Name:            DSA-5169-1
>       CVSS Score v3:   9.8
>       Severity:        critical
> The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
> upgrade the version of openssl to 1.1.1n-0+deb11u3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40126) Security scanning spark v3.3.0 results in DSA-5169-1 critical vulnerability

2022-08-17 Thread Jason Tan (Jira)

Jason Tan created SPARK-40126:
-

 Summary: Security scanning spark v3.3.0 results in DSA-5169-1 
critical vulnerability
 Key: SPARK-40126
 URL: https://issues.apache.org/jira/browse/SPARK-40126
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.3.0
Reporter: Jason Tan


Dear Spark Team,

Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
discovered the following vulnerability/scan results within the image :



      Type:            VULNERABILITY

      Name:            DSA-5169-1

      CVSS Score v3:   9.8

      Severity:        critical

The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
upgrade the version of openssl to 1.1.1n-0+deb11u3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40119) Add reason for cancelJobGroup

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40119:


Assignee: (was: Apache Spark)

> Add reason for cancelJobGroup 
> --
>
> Key: SPARK-40119
> URL: https://issues.apache.org/jira/browse/SPARK-40119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Santosh Pingale
>Priority: Minor
>
> Currently, `cancelJob` supports passing the reason for failure. We use 
> `cancelJobGroup` in a few cases of async actions. It would be great to pass 
> reason of cancellation to the job group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40119) Add reason for cancelJobGroup

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40119:


Assignee: Apache Spark

> Add reason for cancelJobGroup 
> --
>
> Key: SPARK-40119
> URL: https://issues.apache.org/jira/browse/SPARK-40119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Santosh Pingale
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `cancelJob` supports passing the reason for failure. We use 
> `cancelJobGroup` in a few cases of async actions. It would be great to pass 
> reason of cancellation to the job group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40119) Add reason for cancelJobGroup

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580835#comment-17580835
 ] 

Apache Spark commented on SPARK-40119:
--

User 'santosh-d3vpl3x' has created a pull request for this issue:
https://github.com/apache/spark/pull/37555

> Add reason for cancelJobGroup 
> --
>
> Key: SPARK-40119
> URL: https://issues.apache.org/jira/browse/SPARK-40119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Santosh Pingale
>Priority: Minor
>
> Currently, `cancelJob` supports passing the reason for failure. We use 
> `cancelJobGroup` in a few cases of async actions. It would be great to pass 
> reason of cancellation to the job group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-08-17 Thread Pablo Langa Blanco (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-39623.

Resolution: Not A Problem

> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40114) Arrow 9.0.0 support with SparkR

2022-08-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40114.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37553
[https://github.com/apache/spark/pull/37553]

> Arrow 9.0.0 support with SparkR
> ---
>
> Key: SPARK-40114
> URL: https://issues.apache.org/jira/browse/SPARK-40114
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> == Failed 
> ==
> -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:103:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:133:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:143:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:184:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:217:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:229:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow 
> optimiz
> `count(...)` threw an error with unexpected message.
> Expected match: "expected IntegerType, IntegerType, got IntegerType, 
> StringType"
> Actual message: "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): 
> org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced 
> errors: The tzdb package is not installed. Timezones will not be available to 
> Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : 
> write_arrow has been removed\nCalls:  -> writeRaw -> writeInt -> 
> writeBin -> \nExecution halted\n\r\n\tat 
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat
>  
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat
>  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
>

[jira] [Assigned] (SPARK-40114) Arrow 9.0.0 support with SparkR

2022-08-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40114:
-

Assignee: Hyukjin Kwon

> Arrow 9.0.0 support with SparkR
> ---
>
> Key: SPARK-40114
> URL: https://issues.apache.org/jira/browse/SPARK-40114
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> == Failed 
> ==
> -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:103:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:133:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:143:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:184:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:217:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:229:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow 
> optimiz
> `count(...)` threw an error with unexpected message.
> Expected match: "expected IntegerType, IntegerType, got IntegerType, 
> StringType"
> Actual message: "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): 
> org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced 
> errors: The tzdb package is not installed. Timezones will not be available to 
> Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : 
> write_arrow has been removed\nCalls:  -> writeRaw -> writeInt -> 
> writeBin -> \nExecution halted\n\r\n\tat 
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat
>  
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat
>  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown
>

[jira] [Commented] (SPARK-40125) Add separate infra image for lint job

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580806#comment-17580806
 ] 

Apache Spark commented on SPARK-40125:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37550

> Add separate infra image for lint job
> -
>
> Key: SPARK-40125
> URL: https://issues.apache.org/jira/browse/SPARK-40125
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> To aovid the issue like [#37243 
> (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] 
> , we had some initial discussion, we'd better move infra image into lint 
> image to make lint deps static and upgrade manually. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40125) Add separate infra image for lint job

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40125:


Assignee: (was: Apache Spark)

> Add separate infra image for lint job
> -
>
> Key: SPARK-40125
> URL: https://issues.apache.org/jira/browse/SPARK-40125
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> To aovid the issue like [#37243 
> (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] 
> , we had some initial discussion, we'd better move infra image into lint 
> image to make lint deps static and upgrade manually. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40125) Add separate infra image for lint job

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580805#comment-17580805
 ] 

Apache Spark commented on SPARK-40125:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37550

> Add separate infra image for lint job
> -
>
> Key: SPARK-40125
> URL: https://issues.apache.org/jira/browse/SPARK-40125
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> To aovid the issue like [#37243 
> (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] 
> , we had some initial discussion, we'd better move infra image into lint 
> image to make lint deps static and upgrade manually. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40125) Add separate infra image for lint job

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40125:


Assignee: Apache Spark

> Add separate infra image for lint job
> -
>
> Key: SPARK-40125
> URL: https://issues.apache.org/jira/browse/SPARK-40125
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> To aovid the issue like [#37243 
> (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] 
> , we had some initial discussion, we'd better move infra image into lint 
> image to make lint deps static and upgrade manually. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40125) Add separate infra image for lint job

2022-08-17 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40125:
---

 Summary: Add separate infra image for lint job
 Key: SPARK-40125
 URL: https://issues.apache.org/jira/browse/SPARK-40125
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang


To aovid the issue like [#37243 
(comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] , 
we had some initial discussion, we'd better move infra image into lint image to 
make lint deps static and upgrade manually. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580791#comment-17580791
 ] 

Apache Spark commented on SPARK-40124:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37554

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40124:


Assignee: Apache Spark

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40124:


Assignee: (was: Apache Spark)

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-17 Thread Kapil Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Singh updated SPARK-40124:

Summary: Update TPCDS v1.4 q32 for Plan Stability tests  (was: Update TPCDS 
v1.4 query32)

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40124) Update TPCDS v1.4 query32

2022-08-17 Thread Kapil Singh (Jira)

Kapil Singh created SPARK-40124:
---

 Summary: Update TPCDS v1.4 query32
 Key: SPARK-40124
 URL: https://issues.apache.org/jira/browse/SPARK-40124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kapil Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar

2022-08-17 Thread manohar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

manohar updated SPARK-40123:

 Flags: Patch
Labels: security-issue  (was: )

> Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar
> 
>
> Key: SPARK-40123
> URL: https://issues.apache.org/jira/browse/SPARK-40123
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.3.0
>Reporter: manohar
>Priority: Major
>  Labels: security-issue
> Fix For: 3.3.1
>
>
> Hello Team,
> We are facing this vulnerability on Spark Installation 3.3.3 , Can we please 
> upgrade the version of mesos in our installation to address this 
> vulnerability. 
> ||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path||
> |1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, 
> 1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar|
> In our source code i found that the depedant version of mesos jar is 1.4.3 
> user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * 
> core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * 
> TaskSchedulerImpl. We assume a Mesos-like model where the application gets 
> resource offers as
> *dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
> dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
> *



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar

2022-08-17 Thread manohar (Jira)

manohar created SPARK-40123:
---

 Summary: Security Vulnerability CVE-2018-11793 due to 
mesos-1.4.3-shaded-protobuf.jar
 Key: SPARK-40123
 URL: https://issues.apache.org/jira/browse/SPARK-40123
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 3.3.0
Reporter: manohar
 Fix For: 3.3.1


Hello Team,
We are facing this vulnerability on Spark Installation 3.3.3 , Can we please 
upgrade the version of mesos in our installation to address this vulnerability. 


||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path||
|1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, 
1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar|


In our source code i found that the depedant version of mesos jar is 1.4.3 

user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * 
core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * 
TaskSchedulerImpl. We assume a Mesos-like model where the application gets 
resource offers as
*dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40087) Support multiple Column drop in R

2022-08-17 Thread Santosh Pingale (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-40087:

Description: 
This is a followup on SPARK-39895. The PR previously attempted to adjust 
implementation for R as well to match signatures but that part was removed and 
we only focused on getting python implementation to behave correctly.

*{{Change supports following operations:}}*

{{df <- select(read.json(jsonPath), "name", "age")}}

{{df$age2 <- df$age}}

{{df1 <- drop(df, df$age, df$name)}}
{{expect_equal(columns(df1), c("age2"))}}

{{df1 <- drop(df, df$age, column("random"))}}
{{expect_equal(columns(df1), c("name", "age2"))}}

{{df1 <- drop(df, df$age, df$name)}}
{{expect_equal(columns(df1), c("age2"))}}

 

 

 

  was:
This is a followup on SPARK-39895. The PR previously attempted to adjust 
implementation for R as well to match signatures but that part was removed and 
we only focused on getting python implementation to behave correctly.

*{{Change supports following operations:}}*

{{df <- select(read.json(jsonPath), "name", "age")}}

{{df$age2 <- df$age}}

{{df1 <- drop(df, df$age, df$name)}}
{{expect_equal(columns(df1), c("age2"))}}

{{df1 <- drop(df, list(df$age, column("random")))}}
{{expect_equal(columns(df1), c("name", "age2"))}}

{{df1 <- drop(df, list(df$age, df$name))}}
{{expect_equal(columns(df1), c("age2"))}}

 

 

 


> Support multiple Column drop in R
> -
>
> Key: SPARK-40087
> URL: https://issues.apache.org/jira/browse/SPARK-40087
> Project: Spark
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Santosh Pingale
>Priority: Minor
>
> This is a followup on SPARK-39895. The PR previously attempted to adjust 
> implementation for R as well to match signatures but that part was removed 
> and we only focused on getting python implementation to behave correctly.
> *{{Change supports following operations:}}*
> {{df <- select(read.json(jsonPath), "name", "age")}}
> {{df$age2 <- df$age}}
> {{df1 <- drop(df, df$age, df$name)}}
> {{expect_equal(columns(df1), c("age2"))}}
> {{df1 <- drop(df, df$age, column("random"))}}
> {{expect_equal(columns(df1), c("name", "age2"))}}
> {{df1 <- drop(df, df$age, df$name)}}
> {{expect_equal(columns(df1), c("age2"))}}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Maksim Grinman (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762
 ] 

Maksim Grinman edited comment on SPARK-38330 at 8/17/22 12:28 PM:
--

Thanks for the response. I did try compiling myself from the Spark github repo 
in the v3.3.0 tagged commit (with hadoop-aws jar added in the pom) and 
generating the python wheel to see what's in the python wheel and none of them 
have cos in the name: 
{code:java}
[110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars
total 921296
rw-rr-  1 maks  staff   227K Aug 12 13:21 JLargeArrays-1.5.jar
rw-rr-  1 maks  staff   1.1M Aug 12 13:21 JTransforms-3.1.jar
rw-rr-  1 maks  staff   418K Aug 12 13:18 RoaringBitmap-0.9.25.jar
rw-rr-  1 maks  staff    68K Oct  4  2021 activation-1.1.1.jar
rw-rr-  1 maks  staff   179K Aug 12 13:25 aircompressor-0.21.jar
rw-rr-  1 maks  staff   1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar
rw-rr-  1 maks  staff    19K Aug 12 13:25 annotations-17.0.0.jar
rw-rr-  1 maks  staff   330K Aug 12 13:23 antlr4-runtime-4.8.jar
rw-rr-  1 maks  staff    26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar
rw-rr-  1 maks  staff    76K Aug 12 13:28 arpack-2.2.1.jar
rw-rr-  1 maks  staff   1.1M Oct  4  2021 arpack_combined_all-0.1.jar
rw-rr-  1 maks  staff   107K Aug 12 13:23 arrow-format-7.0.0.jar
rw-rr-  1 maks  staff   106K Aug 12 13:23 arrow-memory-core-7.0.0.jar
rw-rr-  1 maks  staff    38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar
rw-rr-  1 maks  staff   1.8M Aug 12 13:23 arrow-vector-7.0.0.jar
rw-rr-  1 maks  staff    20K Aug 12 13:19 audience-annotations-0.5.0.jar
rw-rr-  1 maks  staff   580K Aug 12 13:19 avro-1.11.0.jar
rw-rr-  1 maks  staff   181K Aug 12 13:19 avro-ipc-1.11.0.jar
rw-rr-  1 maks  staff   184K Aug 12 13:19 avro-mapred-1.11.0.jar
rw-rr-  1 maks  staff   216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar
rw-rr-  1 maks  staff   194K Aug 12 13:21 blas-2.2.1.jar
rw-rr-  1 maks  staff    73K Aug 12 13:21 breeze-macros_2.12-1.2.jar
rw-rr-  1 maks  staff    13M Aug 12 13:21 breeze_2.12-1.2.jar
rw-rr-  1 maks  staff   3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar
rw-rr-  1 maks  staff    57K Aug 12 13:18 chill-java-0.10.0.jar
rw-rr-  1 maks  staff   207K Aug 12 13:18 chill_2.12-0.10.0.jar
rw-rr-  1 maks  staff   346K Aug 12 13:18 commons-codec-1.15.jar
rw-rr-  1 maks  staff   575K Oct  4  2021 commons-collections-3.2.2.jar
rw-rr-  1 maks  staff   734K Aug 12 13:19 commons-collections4-4.4.jar
rw-rr-  1 maks  staff    70K Aug 12 13:23 commons-compiler-3.0.16.jar
rw-rr-  1 maks  staff   994K Aug 12 13:19 commons-compress-1.21.jar
rw-rr-  1 maks  staff   162K Aug 12 13:18 commons-crypto-1.1.0.jar
rw-rr-  1 maks  staff   319K Aug 12 13:18 commons-io-2.11.0.jar
rw-rr-  1 maks  staff   278K Jan 15  2021 commons-lang-2.6.jar
rw-rr-  1 maks  staff   574K Aug 12 13:18 commons-lang3-3.12.0.jar
rw-rr-  1 maks  staff    61K Oct  4  2021 commons-logging-1.1.3.jar
rw-rr-  1 maks  staff   2.1M Aug 12 13:19 commons-math3-3.6.1.jar
rw-rr-  1 maks  staff   211K Aug 12 13:18 commons-text-1.9.jar
rw-rr-  1 maks  staff    80K Aug 12 13:19 compress-lzf-1.1.jar
rw-rr-  1 maks  staff   161K Oct  4  2021 core-1.1.2.jar
rw-rr-  1 maks  staff   2.3M Aug 12 13:19 curator-client-2.13.0.jar
rw-rr-  1 maks  staff   197K Aug 12 13:19 curator-framework-2.13.0.jar
rw-rr-  1 maks  staff   277K Aug 12 13:19 curator-recipes-2.13.0.jar
rw-rr-  1 maks  staff    63K Aug 12 13:23 flatbuffers-java-1.12.0.jar
rw-rr-  1 maks  staff   235K Aug 12 13:18 gson-2.8.6.jar
rw-rr-  1 maks  staff   2.1M Oct  4  2021 guava-14.0.1.jar
rw-rr-  1 maks  staff   940K Aug 12 13:19 hadoop-aws-3.3.2.jar
rw-rr-  1 maks  staff    19M Aug 12 13:19 hadoop-client-api-3.3.2.jar
rw-rr-  1 maks  staff    29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar
rw-rr-  1 maks  staff   3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar
rw-rr-  1 maks  staff    55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar
rw-rr-  1 maks  staff   231K Aug 12 13:25 hive-storage-api-2.7.2.jar
rw-rr-  1 maks  staff   196K Aug 12 13:19 hk2-api-2.6.1.jar
rw-rr-  1 maks  staff   199K Aug 12 13:19 hk2-locator-2.6.1.jar
rw-rr-  1 maks  staff   129K Aug 12 13:19 hk2-utils-2.6.1.jar
rw-rr-  1 maks  staff    27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar
rw-rr-  1 maks  staff   1.3M Aug 12 13:19 ivy-2.5.0.jar
rw-rr-  1 maks  staff    74K Aug 12 13:18 jackson-annotations-2.13.3.jar
rw-rr-  1 maks  staff   366K Aug 12 13:18 jackson-core-2.13.3.jar
rw-rr-  1 maks  staff   1.5M Aug 12 13:18 jackson-databind-2.13.3.jar
rw-rr-  1 maks  staff   448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar
rw-rr-  1 maks  staff    24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar
rw-rr-  1 maks  staff    18K Aug 12 13:19 jakarta.inject-2.6.1.jar
rw-rr-  1 maks  staff    81K Aug 12 13:19 jakarta.servlet-api-4.0.3.jar
rw-rr-  1 maks  staff    90K Aug 12 13:19 jakarta.validation-api-2.0.2.jar
rw-rr-  1

[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Maksim Grinman (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762
 ] 

Maksim Grinman edited comment on SPARK-38330 at 8/17/22 12:23 PM:
--

Thanks for the response. I did try compiling myself (with hadoop-aws jar 
included) and generating the python wheel in Python 3.3 to see what's in the 
python wheel and none of them have cos in the name: 
{code:java}
[110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars
total 921296
rw-rr-  1 maks  staff   227K Aug 12 13:21 JLargeArrays-1.5.jar
rw-rr-  1 maks  staff   1.1M Aug 12 13:21 JTransforms-3.1.jar
rw-rr-  1 maks  staff   418K Aug 12 13:18 RoaringBitmap-0.9.25.jar
rw-rr-  1 maks  staff    68K Oct  4  2021 activation-1.1.1.jar
rw-rr-  1 maks  staff   179K Aug 12 13:25 aircompressor-0.21.jar
rw-rr-  1 maks  staff   1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar
rw-rr-  1 maks  staff    19K Aug 12 13:25 annotations-17.0.0.jar
rw-rr-  1 maks  staff   330K Aug 12 13:23 antlr4-runtime-4.8.jar
rw-rr-  1 maks  staff    26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar
rw-rr-  1 maks  staff    76K Aug 12 13:28 arpack-2.2.1.jar
rw-rr-  1 maks  staff   1.1M Oct  4  2021 arpack_combined_all-0.1.jar
rw-rr-  1 maks  staff   107K Aug 12 13:23 arrow-format-7.0.0.jar
rw-rr-  1 maks  staff   106K Aug 12 13:23 arrow-memory-core-7.0.0.jar
rw-rr-  1 maks  staff    38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar
rw-rr-  1 maks  staff   1.8M Aug 12 13:23 arrow-vector-7.0.0.jar
rw-rr-  1 maks  staff    20K Aug 12 13:19 audience-annotations-0.5.0.jar
rw-rr-  1 maks  staff   580K Aug 12 13:19 avro-1.11.0.jar
rw-rr-  1 maks  staff   181K Aug 12 13:19 avro-ipc-1.11.0.jar
rw-rr-  1 maks  staff   184K Aug 12 13:19 avro-mapred-1.11.0.jar
rw-rr-  1 maks  staff   216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar
rw-rr-  1 maks  staff   194K Aug 12 13:21 blas-2.2.1.jar
rw-rr-  1 maks  staff    73K Aug 12 13:21 breeze-macros_2.12-1.2.jar
rw-rr-  1 maks  staff    13M Aug 12 13:21 breeze_2.12-1.2.jar
rw-rr-  1 maks  staff   3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar
rw-rr-  1 maks  staff    57K Aug 12 13:18 chill-java-0.10.0.jar
rw-rr-  1 maks  staff   207K Aug 12 13:18 chill_2.12-0.10.0.jar
rw-rr-  1 maks  staff   346K Aug 12 13:18 commons-codec-1.15.jar
rw-rr-  1 maks  staff   575K Oct  4  2021 commons-collections-3.2.2.jar
rw-rr-  1 maks  staff   734K Aug 12 13:19 commons-collections4-4.4.jar
rw-rr-  1 maks  staff    70K Aug 12 13:23 commons-compiler-3.0.16.jar
rw-rr-  1 maks  staff   994K Aug 12 13:19 commons-compress-1.21.jar
rw-rr-  1 maks  staff   162K Aug 12 13:18 commons-crypto-1.1.0.jar
rw-rr-  1 maks  staff   319K Aug 12 13:18 commons-io-2.11.0.jar
rw-rr-  1 maks  staff   278K Jan 15  2021 commons-lang-2.6.jar
rw-rr-  1 maks  staff   574K Aug 12 13:18 commons-lang3-3.12.0.jar
rw-rr-  1 maks  staff    61K Oct  4  2021 commons-logging-1.1.3.jar
rw-rr-  1 maks  staff   2.1M Aug 12 13:19 commons-math3-3.6.1.jar
rw-rr-  1 maks  staff   211K Aug 12 13:18 commons-text-1.9.jar
rw-rr-  1 maks  staff    80K Aug 12 13:19 compress-lzf-1.1.jar
rw-rr-  1 maks  staff   161K Oct  4  2021 core-1.1.2.jar
rw-rr-  1 maks  staff   2.3M Aug 12 13:19 curator-client-2.13.0.jar
rw-rr-  1 maks  staff   197K Aug 12 13:19 curator-framework-2.13.0.jar
rw-rr-  1 maks  staff   277K Aug 12 13:19 curator-recipes-2.13.0.jar
rw-rr-  1 maks  staff    63K Aug 12 13:23 flatbuffers-java-1.12.0.jar
rw-rr-  1 maks  staff   235K Aug 12 13:18 gson-2.8.6.jar
rw-rr-  1 maks  staff   2.1M Oct  4  2021 guava-14.0.1.jar
rw-rr-  1 maks  staff   940K Aug 12 13:19 hadoop-aws-3.3.2.jar
rw-rr-  1 maks  staff    19M Aug 12 13:19 hadoop-client-api-3.3.2.jar
rw-rr-  1 maks  staff    29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar
rw-rr-  1 maks  staff   3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar
rw-rr-  1 maks  staff    55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar
rw-rr-  1 maks  staff   231K Aug 12 13:25 hive-storage-api-2.7.2.jar
rw-rr-  1 maks  staff   196K Aug 12 13:19 hk2-api-2.6.1.jar
rw-rr-  1 maks  staff   199K Aug 12 13:19 hk2-locator-2.6.1.jar
rw-rr-  1 maks  staff   129K Aug 12 13:19 hk2-utils-2.6.1.jar
rw-rr-  1 maks  staff    27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar
rw-rr-  1 maks  staff   1.3M Aug 12 13:19 ivy-2.5.0.jar
rw-rr-  1 maks  staff    74K Aug 12 13:18 jackson-annotations-2.13.3.jar
rw-rr-  1 maks  staff   366K Aug 12 13:18 jackson-core-2.13.3.jar
rw-rr-  1 maks  staff   1.5M Aug 12 13:18 jackson-databind-2.13.3.jar
rw-rr-  1 maks  staff   448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar
rw-rr-  1 maks  staff    24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar
rw-rr-  1 maks  staff    18K Aug 12 13:19 jakarta.inject-2.6.1.jar
rw-rr-  1 maks  staff    81K Aug 12 13:19 jakarta.servlet-api-4.0.3.jar
rw-rr-  1 maks  staff    90K Aug 12 13:19 jakarta.validation-api-2.0.2.jar
rw-rr-  1 maks  staff   137K Aug 12 13:19

[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Maksim Grinman (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762
 ] 

Maksim Grinman edited comment on SPARK-38330 at 8/17/22 12:22 PM:
--

Thanks for the response. I did try compiling myself and generating the python 
wheel in Python 3.3 to see what's in the python wheel and none of them have cos 
in the name: 
{code:java}
[110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars
total 921296
rw-rr-  1 maks  staff   227K Aug 12 13:21 JLargeArrays-1.5.jar
rw-rr-  1 maks  staff   1.1M Aug 12 13:21 JTransforms-3.1.jar
rw-rr-  1 maks  staff   418K Aug 12 13:18 RoaringBitmap-0.9.25.jar
rw-rr-  1 maks  staff    68K Oct  4  2021 activation-1.1.1.jar
rw-rr-  1 maks  staff   179K Aug 12 13:25 aircompressor-0.21.jar
rw-rr-  1 maks  staff   1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar
rw-rr-  1 maks  staff    19K Aug 12 13:25 annotations-17.0.0.jar
rw-rr-  1 maks  staff   330K Aug 12 13:23 antlr4-runtime-4.8.jar
rw-rr-  1 maks  staff    26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar
rw-rr-  1 maks  staff    76K Aug 12 13:28 arpack-2.2.1.jar
rw-rr-  1 maks  staff   1.1M Oct  4  2021 arpack_combined_all-0.1.jar
rw-rr-  1 maks  staff   107K Aug 12 13:23 arrow-format-7.0.0.jar
rw-rr-  1 maks  staff   106K Aug 12 13:23 arrow-memory-core-7.0.0.jar
rw-rr-  1 maks  staff    38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar
rw-rr-  1 maks  staff   1.8M Aug 12 13:23 arrow-vector-7.0.0.jar
rw-rr-  1 maks  staff    20K Aug 12 13:19 audience-annotations-0.5.0.jar
rw-rr-  1 maks  staff   580K Aug 12 13:19 avro-1.11.0.jar
rw-rr-  1 maks  staff   181K Aug 12 13:19 avro-ipc-1.11.0.jar
rw-rr-  1 maks  staff   184K Aug 12 13:19 avro-mapred-1.11.0.jar
rw-rr-  1 maks  staff   216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar
rw-rr-  1 maks  staff   194K Aug 12 13:21 blas-2.2.1.jar
rw-rr-  1 maks  staff    73K Aug 12 13:21 breeze-macros_2.12-1.2.jar
rw-rr-  1 maks  staff    13M Aug 12 13:21 breeze_2.12-1.2.jar
rw-rr-  1 maks  staff   3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar
rw-rr-  1 maks  staff    57K Aug 12 13:18 chill-java-0.10.0.jar
rw-rr-  1 maks  staff   207K Aug 12 13:18 chill_2.12-0.10.0.jar
rw-rr-  1 maks  staff   346K Aug 12 13:18 commons-codec-1.15.jar
rw-rr-  1 maks  staff   575K Oct  4  2021 commons-collections-3.2.2.jar
rw-rr-  1 maks  staff   734K Aug 12 13:19 commons-collections4-4.4.jar
rw-rr-  1 maks  staff    70K Aug 12 13:23 commons-compiler-3.0.16.jar
rw-rr-  1 maks  staff   994K Aug 12 13:19 commons-compress-1.21.jar
rw-rr-  1 maks  staff   162K Aug 12 13:18 commons-crypto-1.1.0.jar
rw-rr-  1 maks  staff   319K Aug 12 13:18 commons-io-2.11.0.jar
rw-rr-  1 maks  staff   278K Jan 15  2021 commons-lang-2.6.jar
rw-rr-  1 maks  staff   574K Aug 12 13:18 commons-lang3-3.12.0.jar
rw-rr-  1 maks  staff    61K Oct  4  2021 commons-logging-1.1.3.jar
rw-rr-  1 maks  staff   2.1M Aug 12 13:19 commons-math3-3.6.1.jar
rw-rr-  1 maks  staff   211K Aug 12 13:18 commons-text-1.9.jar
rw-rr-  1 maks  staff    80K Aug 12 13:19 compress-lzf-1.1.jar
rw-rr-  1 maks  staff   161K Oct  4  2021 core-1.1.2.jar
rw-rr-  1 maks  staff   2.3M Aug 12 13:19 curator-client-2.13.0.jar
rw-rr-  1 maks  staff   197K Aug 12 13:19 curator-framework-2.13.0.jar
rw-rr-  1 maks  staff   277K Aug 12 13:19 curator-recipes-2.13.0.jar
rw-rr-  1 maks  staff    63K Aug 12 13:23 flatbuffers-java-1.12.0.jar
rw-rr-  1 maks  staff   235K Aug 12 13:18 gson-2.8.6.jar
rw-rr-  1 maks  staff   2.1M Oct  4  2021 guava-14.0.1.jar
rw-rr-  1 maks  staff   940K Aug 12 13:19 hadoop-aws-3.3.2.jar
rw-rr-  1 maks  staff    19M Aug 12 13:19 hadoop-client-api-3.3.2.jar
rw-rr-  1 maks  staff    29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar
rw-rr-  1 maks  staff   3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar
rw-rr-  1 maks  staff    55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar
rw-rr-  1 maks  staff   231K Aug 12 13:25 hive-storage-api-2.7.2.jar
rw-rr-  1 maks  staff   196K Aug 12 13:19 hk2-api-2.6.1.jar
rw-rr-  1 maks  staff   199K Aug 12 13:19 hk2-locator-2.6.1.jar
rw-rr-  1 maks  staff   129K Aug 12 13:19 hk2-utils-2.6.1.jar
rw-rr-  1 maks  staff    27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar
rw-rr-  1 maks  staff   1.3M Aug 12 13:19 ivy-2.5.0.jar
rw-rr-  1 maks  staff    74K Aug 12 13:18 jackson-annotations-2.13.3.jar
rw-rr-  1 maks  staff   366K Aug 12 13:18 jackson-core-2.13.3.jar
rw-rr-  1 maks  staff   1.5M Aug 12 13:18 jackson-databind-2.13.3.jar
rw-rr-  1 maks  staff   448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar
rw-rr-  1 maks  staff    24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar
rw-rr-  1 maks  staff    18K Aug 12 13:19 jakarta.inject-2.6.1.jar
rw-rr-  1 maks  staff    81K Aug 12 13:19 jakarta.servlet-api-4.0.3.jar
rw-rr-  1 maks  staff    90K Aug 12 13:19 jakarta.validation-api-2.0.2.jar
rw-rr-  1 maks  staff   137K Aug 12 13:19 jakarta.ws.rs-api-2.1.6.jar
rw-rr-  1 maks

[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Maksim Grinman (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762
 ] 

Maksim Grinman commented on SPARK-38330:


Thanks for the response. I did try compiling myself and generating the python 
wheel in Python 3.3 to see what's in the python wheel and none of them have cos 
in the name:

```
 [110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars
total 921296
-rw-r--r--  1 maks  staff   227K Aug 12 13:21 JLargeArrays-1.5.jar
-rw-r--r--  1 maks  staff   1.1M Aug 12 13:21 JTransforms-3.1.jar
-rw-r--r--  1 maks  staff   418K Aug 12 13:18 RoaringBitmap-0.9.25.jar
-rw-r--r--  1 maks  staff    68K Oct  4  2021 activation-1.1.1.jar
-rw-r--r--  1 maks  staff   179K Aug 12 13:25 aircompressor-0.21.jar
-rw-r--r--  1 maks  staff   1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar
-rw-r--r--  1 maks  staff    19K Aug 12 13:25 annotations-17.0.0.jar
-rw-r--r--  1 maks  staff   330K Aug 12 13:23 antlr4-runtime-4.8.jar
-rw-r--r--  1 maks  staff    26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar
-rw-r--r--  1 maks  staff    76K Aug 12 13:28 arpack-2.2.1.jar
-rw-r--r--  1 maks  staff   1.1M Oct  4  2021 arpack_combined_all-0.1.jar
-rw-r--r--  1 maks  staff   107K Aug 12 13:23 arrow-format-7.0.0.jar
-rw-r--r--  1 maks  staff   106K Aug 12 13:23 arrow-memory-core-7.0.0.jar
-rw-r--r--  1 maks  staff    38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar
-rw-r--r--  1 maks  staff   1.8M Aug 12 13:23 arrow-vector-7.0.0.jar
-rw-r--r--  1 maks  staff    20K Aug 12 13:19 audience-annotations-0.5.0.jar
-rw-r--r--  1 maks  staff   580K Aug 12 13:19 avro-1.11.0.jar
-rw-r--r--  1 maks  staff   181K Aug 12 13:19 avro-ipc-1.11.0.jar
-rw-r--r--  1 maks  staff   184K Aug 12 13:19 avro-mapred-1.11.0.jar
-rw-r--r--  1 maks  staff   216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar
-rw-r--r--  1 maks  staff   194K Aug 12 13:21 blas-2.2.1.jar
-rw-r--r--  1 maks  staff    73K Aug 12 13:21 breeze-macros_2.12-1.2.jar
-rw-r--r--  1 maks  staff    13M Aug 12 13:21 breeze_2.12-1.2.jar
-rw-r--r--  1 maks  staff   3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar
-rw-r--r--  1 maks  staff    57K Aug 12 13:18 chill-java-0.10.0.jar
-rw-r--r--  1 maks  staff   207K Aug 12 13:18 chill_2.12-0.10.0.jar
-rw-r--r--  1 maks  staff   346K Aug 12 13:18 commons-codec-1.15.jar
-rw-r--r--  1 maks  staff   575K Oct  4  2021 commons-collections-3.2.2.jar
-rw-r--r--  1 maks  staff   734K Aug 12 13:19 commons-collections4-4.4.jar
-rw-r--r--  1 maks  staff    70K Aug 12 13:23 commons-compiler-3.0.16.jar
-rw-r--r--  1 maks  staff   994K Aug 12 13:19 commons-compress-1.21.jar
-rw-r--r--  1 maks  staff   162K Aug 12 13:18 commons-crypto-1.1.0.jar
-rw-r--r--  1 maks  staff   319K Aug 12 13:18 commons-io-2.11.0.jar
-rw-r--r--  1 maks  staff   278K Jan 15  2021 commons-lang-2.6.jar
-rw-r--r--  1 maks  staff   574K Aug 12 13:18 commons-lang3-3.12.0.jar
-rw-r--r--  1 maks  staff    61K Oct  4  2021 commons-logging-1.1.3.jar
-rw-r--r--  1 maks  staff   2.1M Aug 12 13:19 commons-math3-3.6.1.jar
-rw-r--r--  1 maks  staff   211K Aug 12 13:18 commons-text-1.9.jar
-rw-r--r--  1 maks  staff    80K Aug 12 13:19 compress-lzf-1.1.jar
-rw-r--r--  1 maks  staff   161K Oct  4  2021 core-1.1.2.jar
-rw-r--r--  1 maks  staff   2.3M Aug 12 13:19 curator-client-2.13.0.jar
-rw-r--r--  1 maks  staff   197K Aug 12 13:19 curator-framework-2.13.0.jar
-rw-r--r--  1 maks  staff   277K Aug 12 13:19 curator-recipes-2.13.0.jar
-rw-r--r--  1 maks  staff    63K Aug 12 13:23 flatbuffers-java-1.12.0.jar
-rw-r--r--  1 maks  staff   235K Aug 12 13:18 gson-2.8.6.jar
-rw-r--r--  1 maks  staff   2.1M Oct  4  2021 guava-14.0.1.jar
-rw-r--r--  1 maks  staff   940K Aug 12 13:19 hadoop-aws-3.3.2.jar
-rw-r--r--  1 maks  staff    19M Aug 12 13:19 hadoop-client-api-3.3.2.jar
-rw-r--r--  1 maks  staff    29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar
-rw-r--r--  1 maks  staff   3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar
-rw-r--r--  1 maks  staff    55K Aug 12 14:02 
hadoop-yarn-server-web-proxy-3.3.2.jar
-rw-r--r--  1 maks  staff   231K Aug 12 13:25 hive-storage-api-2.7.2.jar
-rw-r--r--  1 maks  staff   196K Aug 12 13:19 hk2-api-2.6.1.jar
-rw-r--r--  1 maks  staff   199K Aug 12 13:19 hk2-locator-2.6.1.jar
-rw-r--r--  1 maks  staff   129K Aug 12 13:19 hk2-utils-2.6.1.jar
-rw-r--r--  1 maks  staff    27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar
-rw-r--r--  1 maks  staff   1.3M Aug 12 13:19 ivy-2.5.0.jar
-rw-r--r--  1 maks  staff    74K Aug 12 13:18 jackson-annotations-2.13.3.jar
-rw-r--r--  1 maks  staff   366K Aug 12 13:18 jackson-core-2.13.3.jar
-rw-r--r--  1 maks  staff   1.5M Aug 12 13:18 jackson-databind-2.13.3.jar
-rw-r--r--  1 maks  staff   448K Aug 12 13:19 
jackson-module-scala_2.12-2.13.3.jar
-rw-r--r--  1 maks  staff    24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar
-rw-r--r--  1 maks  staff    18K Aug 12 13:19 jakarta.inject-2.6.1.jar
-rw-r--r--  1 maks  staff    81K Aug 12

[jira] [Updated] (SPARK-40122) py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0

2022-08-17 Thread Ihor Bobak (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ihor Bobak updated SPARK-40122:
---
Description: 
Without any visible reason I am getting this error in my Jupyter notebook (see 
stacktrace below) with pyspark kernel. Often it occurs even if no Spark 
operations are made, e.g. when I am working with multiprocessing Pool for a 
local piece of code that should parallelize on the cores of the driver, with no 
spark transformations/actions done in that jupyter cell.


INFO:py4j.clientserver:Error while sending or receiving.
Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

INFO:py4j.clientserver:Closing down clientserver connection
INFO:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 506, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
INFO:py4j.clientserver:Closing down clientserver connection



  was:
Without any visible reason I am getting this error in my Jupyter notebook (see 
stacktrace below) with pyspark kernel. Often it occurs even if no Spark 
operations are made, e.g. when I am working with multiprocessing Pool for a 
local piece of code that should parallelize on the cores of the driver, with no 
spark transformations/actions done in that jupyter cell.

{{INFO:py4j.clientserver:Error while sending or receiving.
Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

INFO:py4j.clientserver:Closing down clientserver connection
INFO:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 506, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
INFO:py4j.clientserver:Closing down clientserver connection

}}


> py4j-0.10.9.5 often produces "Connection reset by peer"  in Spark 3.3.0
> ---
>
> Key: SPARK-40122
> URL: https://issues.apache.org/jira/browse/SPARK-40122
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Ihor Bobak
>Priority: Major
>
> Without any visible reason I am getting this error in my Jupyter notebook 
> (see stacktrace below) with pyspark kernel. Often it occurs even if no Spark 
> operations are made, e.g. when I am working with multiprocessing Pool for a 
> local piece of code that should parallelize on the cores of the driver, with 
> no spark transformations/actions done in that jupyter cell.
> INFO:py4j.clientserver:Error while sending or receiving.
> Traceback (most recent call last):
>   File 
> "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
>  line 503, in send_command
> self.socket.sendall(command.encode("utf-8"))
> ConnectionResetError: [Errno 104] Connection reset by peer
> INFO:py4j.clientserver:Closing down clientserver connection
> INFO:root:Exception while sending command.
> Traceback (most recent call last):
>   File 
> "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
>  line 503, in send_command
> self.socket.sendall(command.encode("utf-8"))
> ConnectionResetError: [Errno 104]

[jira] [Created] (SPARK-40122) py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0

2022-08-17 Thread Ihor Bobak (Jira)

Ihor Bobak created SPARK-40122:
--

 Summary: py4j-0.10.9.5 often produces "Connection reset by peer"  
in Spark 3.3.0
 Key: SPARK-40122
 URL: https://issues.apache.org/jira/browse/SPARK-40122
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Ihor Bobak


Without any visible reason I am getting this error in my Jupyter notebook (see 
stacktrace below) with pyspark kernel. Often it occurs even if no Spark 
operations are made, e.g. when I am working with multiprocessing Pool for a 
local piece of code that should parallelize on the cores of the driver, with no 
spark transformations/actions done in that jupyter cell.

{{INFO:py4j.clientserver:Error while sending or receiving.
Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

INFO:py4j.clientserver:Closing down clientserver connection
INFO:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
  File 
"/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
 line 506, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
INFO:py4j.clientserver:Closing down clientserver connection

}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40114) Arrow 9.0.0 support with SparkR

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580731#comment-17580731
 ] 

Apache Spark commented on SPARK-40114:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37553

> Arrow 9.0.0 support with SparkR
> ---
>
> Key: SPARK-40114
> URL: https://issues.apache.org/jira/browse/SPARK-40114
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> == Failed 
> ==
> -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:103:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:133:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:143:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:184:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:217:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:229:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow 
> optimiz
> `count(...)` threw an error with unexpected message.
> Expected match: "expected IntegerType, IntegerType, got IntegerType, 
> StringType"
> Actual message: "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): 
> org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced 
> errors: The tzdb package is not installed. Timezones will not be available to 
> Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : 
> write_arrow has been removed\nCalls:  -> writeRaw -> writeInt -> 
> writeBin -> \nExecution halted\n\r\n\tat 
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat
>  
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat
>  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
>

[jira] [Assigned] (SPARK-40114) Arrow 9.0.0 support with SparkR

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40114:


Assignee: (was: Apache Spark)

> Arrow 9.0.0 support with SparkR
> ---
>
> Key: SPARK-40114
> URL: https://issues.apache.org/jira/browse/SPARK-40114
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> == Failed 
> ==
> -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:103:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:133:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:143:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:184:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:217:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:229:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow 
> optimiz
> `count(...)` threw an error with unexpected message.
> Expected match: "expected IntegerType, IntegerType, got IntegerType, 
> StringType"
> Actual message: "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): 
> org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced 
> errors: The tzdb package is not installed. Timezones will not be available to 
> Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : 
> write_arrow has been removed\nCalls:  -> writeRaw -> writeInt -> 
> writeBin -> \nExecution halted\n\r\n\tat 
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat
>  
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat
>  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)\r\n\tat 
>

[jira] [Assigned] (SPARK-40114) Arrow 9.0.0 support with SparkR

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40114:


Assignee: Apache Spark

> Arrow 9.0.0 support with SparkR
> ---
>
> Key: SPARK-40114
> URL: https://issues.apache.org/jira/browse/SPARK-40114
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> == Failed 
> ==
> -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:103:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:133:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:143:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization 
> -
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:184:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>  1. SparkR::collect(ret)
>   at test_sparkSQL_arrow.R:217:2
>  2. SparkR::collect(ret)
>  3. SparkR (local) .local(x, ...)
>  7. SparkR:::readRaw(conn)
>  8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type 
> sp
> Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
> 'n' argument
> Backtrace:
>   1. testthat::expect_true(all(collect(ret) == rdf))
>at test_sparkSQL_arrow.R:229:2
>   5. SparkR::collect(ret)
>   6. SparkR (local) .local(x, ...)
>  10. SparkR:::readRaw(conn)
>  11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
> -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow 
> optimiz
> `count(...)` threw an error with unexpected message.
> Expected match: "expected IntegerType, IntegerType, got IntegerType, 
> StringType"
> Actual message: "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): 
> org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced 
> errors: The tzdb package is not installed. Timezones will not be available to 
> Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : 
> write_arrow has been removed\nCalls:  -> writeRaw -> writeInt -> 
> writeBin -> \nExecution halted\n\r\n\tat 
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat
>  
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat
>  
> org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat
>  
> org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat
>  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown
>

[jira] [Assigned] (SPARK-40121) Initialize projection used for Python UDF

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40121:


Assignee: Apache Spark

> Initialize projection used for Python UDF
> -
>
> Key: SPARK-40121
> URL: https://issues.apache.org/jira/browse/SPARK-40121
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> >>> from pyspark.sql.functions import udf, rand
> >>> spark.range(10).select(udf(lambda x: x)(rand())).show()
> {code}
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40121) Initialize projection used for Python UDF

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40121:


Assignee: (was: Apache Spark)

> Initialize projection used for Python UDF
> -
>
> Key: SPARK-40121
> URL: https://issues.apache.org/jira/browse/SPARK-40121
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> >>> from pyspark.sql.functions import udf, rand
> >>> spark.range(10).select(udf(lambda x: x)(rand())).show()
> {code}
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40121) Initialize projection used for Python UDF

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580714#comment-17580714
 ] 

Apache Spark commented on SPARK-40121:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37552

> Initialize projection used for Python UDF
> -
>
> Key: SPARK-40121
> URL: https://issues.apache.org/jira/browse/SPARK-40121
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> >>> from pyspark.sql.functions import udf, rand
> >>> spark.range(10).select(udf(lambda x: x)(rand())).show()
> {code}
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-08-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580711#comment-17580711
 ] 

Steve Loughran commented on SPARK-38954:


any plans to put the PR up? i'm curious about what you've done.

The hadoop s3a delegation tokens can be used to collect credentials and 
encryption secrets at spark launch, pass them to workers, though there's no 
mechanism to update tokens during the life of a session.

you might want to look at this code, and experiment with it.

if you are doing your own provider, do update credentials at least 30s before 
they expire, and add some sync blocks so that 30 threads don't all try and do 
it independently. 

> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40121) Initialize projection used for Python UDF

2022-08-17 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40121:


 Summary: Initialize projection used for Python UDF
 Key: SPARK-40121
 URL: https://issues.apache.org/jira/browse/SPARK-40121
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.4.0
Reporter: Hyukjin Kwon


{code}
>>> from pyspark.sql.functions import udf, rand
>>> spark.range(10).select(udf(lambda x: x)(rand())).show()
{code}

{code}
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at 
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38445) Are hadoop committers used in Structured Streaming?

2022-08-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580710#comment-17580710
 ] 

Steve Loughran commented on SPARK-38445:


SPARK-40039 might address this

> Are hadoop committers used in Structured Streaming?
> ---
>
> Key: SPARK-38445
> URL: https://issues.apache.org/jira/browse/SPARK-38445
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Martin Andersson
>Priority: Major
>  Labels: structured-streaming
>
> At the company I work at we're using Spark Structured Streaming to sink 
> messages on kafka to HDFS. We're in the late stages of migrating this 
> component to instead sink messages to AWS S3, and in connection with that we 
> hit upon a couple of issues regarding hadoop committers.
> I've come to understand that the default "file" committer (documented 
> [here|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#Switching_to_an_S3A_Committer])
>  is unsafe to use in S3, which is why [this page in the spark 
> documentation|https://spark.apache.org/docs/3.2.1/cloud-integration.html] 
> recommends using the "directory" (i.e. staging) committer, and in later 
> versions of hadoop they also recommend to use the "magic" committer.
> However, it's not clear whether spark structured streaming even use 
> committers. There's no "_SUCCESS" file in destination (as compared to normal 
> spark jobs), and the documentation regarding committers used in streaming is 
> non-existent.
> Can anyone please shed some light on this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition

2022-08-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40105:
---

Assignee: XiDuo You

> Improve repartition in ReplaceCTERefWithRepartition
> ---
>
> Key: SPARK-40105
> URL: https://issues.apache.org/jira/browse/SPARK-40105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.4.0
>
>
> If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
> to force a shuffle so that the reference can reuse shuffle exchange.
> The added repartition should be optimized by AQE for better performance.
> If the user has specified a rebalance, the ReplaceCTERefWithRepartition 
> should skip add repartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580707#comment-17580707
 ] 

Steve Loughran edited comment on SPARK-38330 at 8/17/22 9:46 AM:
-

bq. Is there a way to work-around this issue while waiting for a version of 
Spark which uses hadoop 3.3.4 (Spark 3.4?)

remove all jars with cos in the title from your classpath

note, emr is unaffected by this. so are cloudera products, primarily because 
they never backported the cos module. this is why it didn't show up in those 
tests.


was (Author: ste...@apache.org):
bq. Is there a way to work-around this issue while waiting for a version of 
Spark which uses hadoop 3.3.4 (Spark 3.4?)

remove all jars with cos in the title from your classpath

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
>

[jira] [Resolved] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition

2022-08-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40105.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37537
[https://github.com/apache/spark/pull/37537]

> Improve repartition in ReplaceCTERefWithRepartition
> ---
>
> Key: SPARK-40105
> URL: https://issues.apache.org/jira/browse/SPARK-40105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
> Fix For: 3.4.0
>
>
> If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
> to force a shuffle so that the reference can reuse shuffle exchange.
> The added repartition should be optimized by AQE for better performance.
> If the user has specified a rebalance, the ReplaceCTERefWithRepartition 
> should skip add repartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580707#comment-17580707
 ] 

Steve Loughran edited comment on SPARK-38330 at 8/17/22 9:45 AM:
-

bq. Is there a way to work-around this issue while waiting for a version of 
Spark which uses hadoop 3.3.4 (Spark 3.4?)

remove all jars with cos in the title from your classpath


was (Author: ste...@apache.org):
remove all jars with cos in the title from your classpath

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
>

[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580707#comment-17580707
 ] 

Steve Loughran commented on SPARK-38330:


remove all jars with cos in the title from your classpath

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333)
>   at 
>

[jira] [Created] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained

2022-08-17 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40120:


 Summary: Make pyspark.sql.readwriter examples self-contained
 Key: SPARK-40120
 URL: https://issues.apache.org/jira/browse/SPARK-40120
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40050) Eliminate the Sort if there is a LocalLimit between Join and Sort

2022-08-17 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40050:

Summary: Eliminate the Sort if there is a LocalLimit between Join and Sort  
(was: Eliminate sort if parent is local limit)

> Eliminate the Sort if there is a LocalLimit between Join and Sort
> -
>
> Key: SPARK-40050
> URL: https://issues.apache.org/jira/browse/SPARK-40050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> It seems we can remove Sort operator:
> {code:scala}
> val projectPlan = testRelation.select($"a", $"b")
> val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc)
> val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan)
> val projectPlanB = testRelationB.select($"d")
> val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", 
> $"d")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40119) Add reason for cancelJobGroup

2022-08-17 Thread Santosh Pingale (Jira)

Santosh Pingale created SPARK-40119:
---

 Summary: Add reason for cancelJobGroup 
 Key: SPARK-40119
 URL: https://issues.apache.org/jira/browse/SPARK-40119
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Santosh Pingale


Currently, `cancelJob` supports passing the reason for failure. We use 
`cancelJobGroup` in a few cases of async actions. It would be great to pass 
reason of cancellation to the job group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40050) Eliminate sort if parent is local limit

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40050:


Assignee: Apache Spark

> Eliminate sort if parent is local limit
> ---
>
> Key: SPARK-40050
> URL: https://issues.apache.org/jira/browse/SPARK-40050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> It seems we can remove Sort operator:
> {code:scala}
> val projectPlan = testRelation.select($"a", $"b")
> val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc)
> val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan)
> val projectPlanB = testRelationB.select($"d")
> val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", 
> $"d")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40050) Eliminate sort if parent is local limit

2022-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580661#comment-17580661
 ] 

Apache Spark commented on SPARK-40050:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37519

> Eliminate sort if parent is local limit
> ---
>
> Key: SPARK-40050
> URL: https://issues.apache.org/jira/browse/SPARK-40050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> It seems we can remove Sort operator:
> {code:scala}
> val projectPlan = testRelation.select($"a", $"b")
> val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc)
> val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan)
> val projectPlanB = testRelationB.select($"d")
> val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", 
> $"d")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40050) Eliminate sort if parent is local limit

2022-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40050:


Assignee: (was: Apache Spark)

> Eliminate sort if parent is local limit
> ---
>
> Key: SPARK-40050
> URL: https://issues.apache.org/jira/browse/SPARK-40050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> It seems we can remove Sort operator:
> {code:scala}
> val projectPlan = testRelation.select($"a", $"b")
> val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc)
> val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan)
> val projectPlanB = testRelationB.select($"d")
> val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", 
> $"d")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 117 matches

Mail list logo