[jira] [Assigned] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
[ https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40133: Assignee: (was: Apache Spark) > Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is > true > --- > > Key: SPARK-40133 > URL: https://issues.apache.org/jira/browse/SPARK-40133 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
[ https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581160#comment-17581160 ] Apache Spark commented on SPARK-40133: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37562 > Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is > true > --- > > Key: SPARK-40133 > URL: https://issues.apache.org/jira/browse/SPARK-40133 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
[ https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40133: Assignee: Apache Spark > Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is > true > --- > > Key: SPARK-40133 > URL: https://issues.apache.org/jira/browse/SPARK-40133 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
[ https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581159#comment-17581159 ] Apache Spark commented on SPARK-40133: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37562 > Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is > true > --- > > Key: SPARK-40133 > URL: https://issues.apache.org/jira/browse/SPARK-40133 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
Yuming Wang created SPARK-40133: --- Summary: Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true Key: SPARK-40133 URL: https://issues.apache.org/jira/browse/SPARK-40133 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
[ https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40132. -- Fix Version/s: 3.4.0 3.3.1 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37561 > MultilayerPerceptronClassifier rawPredictionCol param missing from setParams > > > Key: SPARK-40132 > URL: https://issues.apache.org/jira/browse/SPARK-40132 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > > https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in > Pyspark ML's classification.py but inadvertently removed the parameter > rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes > its constructor to fail when this param is set in the constructor, as it > isn't recognized by setParams, called by the constructor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar
[ https://issues.apache.org/jira/browse/SPARK-40123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40123: - Fix Version/s: (was: 3.3.1) > Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar > > > Key: SPARK-40123 > URL: https://issues.apache.org/jira/browse/SPARK-40123 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 3.3.0 >Reporter: manohar >Priority: Major > Labels: security-issue > > Hello Team, > We are facing this vulnerability on Spark Installation 3.3.3 , Can we please > upgrade the version of mesos in our installation to address this > vulnerability. > ||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path|| > |1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, > 1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar| > In our source code i found that the depedant version of mesos jar is 1.4.3 > user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * > core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * > TaskSchedulerImpl. We assume a Mesos-like model where the application gets > resource offers as > *dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar > dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40115) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40115. -- Resolution: Invalid > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40115 > URL: https://issues.apache.org/jira/browse/SPARK-40115 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently SparkR tests fail > https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387 because > SparkR does not support Arrow 9.0.0+, see also SPARK-40114 > We should pin the version to 8.0.0 for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38946) Generates a new dataframe instead of operating inplace in setitem
[ https://issues.apache.org/jira/browse/SPARK-38946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38946: Assignee: Yikun Jiang > Generates a new dataframe instead of operating inplace in setitem > - > > Key: SPARK-38946 > URL: https://issues.apache.org/jira/browse/SPARK-38946 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > {code:java} > DataFrameTest.test_eval > DataFrameTest.test_update > DataFrameTest.test_inplace > DataFrameTest.test_fillna{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38946) Generates a new dataframe instead of operating inplace in setitem
[ https://issues.apache.org/jira/browse/SPARK-38946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38946. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36353 [https://github.com/apache/spark/pull/36353] > Generates a new dataframe instead of operating inplace in setitem > - > > Key: SPARK-38946 > URL: https://issues.apache.org/jira/browse/SPARK-38946 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > {code:java} > DataFrameTest.test_eval > DataFrameTest.test_update > DataFrameTest.test_inplace > DataFrameTest.test_fillna{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40121) Initialize projection used for Python UDF
[ https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40121. -- Fix Version/s: 3.3.1 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37552 [https://github.com/apache/spark/pull/37552] > Initialize projection used for Python UDF > - > > Key: SPARK-40121 > URL: https://issues.apache.org/jira/browse/SPARK-40121 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0 > > > {code} > >>> from pyspark.sql.functions import udf, rand > >>> spark.range(10).select(udf(lambda x: x)(rand())).show() > {code} > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40121) Initialize projection used for Python UDF
[ https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40121: Assignee: Hyukjin Kwon > Initialize projection used for Python UDF > - > > Key: SPARK-40121 > URL: https://issues.apache.org/jira/browse/SPARK-40121 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > {code} > >>> from pyspark.sql.functions import udf, rand > >>> spark.range(10).select(udf(lambda x: x)(rand())).show() > {code} > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
[ https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1758#comment-1758 ] Apache Spark commented on SPARK-40132: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/37561 > MultilayerPerceptronClassifier rawPredictionCol param missing from setParams > > > Key: SPARK-40132 > URL: https://issues.apache.org/jira/browse/SPARK-40132 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in > Pyspark ML's classification.py but inadvertently removed the parameter > rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes > its constructor to fail when this param is set in the constructor, as it > isn't recognized by setParams, called by the constructor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
[ https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40132: Assignee: Apache Spark (was: Sean R. Owen) > MultilayerPerceptronClassifier rawPredictionCol param missing from setParams > > > Key: SPARK-40132 > URL: https://issues.apache.org/jira/browse/SPARK-40132 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: Sean R. Owen >Assignee: Apache Spark >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in > Pyspark ML's classification.py but inadvertently removed the parameter > rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes > its constructor to fail when this param is set in the constructor, as it > isn't recognized by setParams, called by the constructor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
[ https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40132: Assignee: Sean R. Owen (was: Apache Spark) > MultilayerPerceptronClassifier rawPredictionCol param missing from setParams > > > Key: SPARK-40132 > URL: https://issues.apache.org/jira/browse/SPARK-40132 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in > Pyspark ML's classification.py but inadvertently removed the parameter > rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes > its constructor to fail when this param is set in the constructor, as it > isn't recognized by setParams, called by the constructor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
[ https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581110#comment-17581110 ] Apache Spark commented on SPARK-40132: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/37561 > MultilayerPerceptronClassifier rawPredictionCol param missing from setParams > > > Key: SPARK-40132 > URL: https://issues.apache.org/jira/browse/SPARK-40132 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in > Pyspark ML's classification.py but inadvertently removed the parameter > rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes > its constructor to fail when this param is set in the constructor, as it > isn't recognized by setParams, called by the constructor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
Sean R. Owen created SPARK-40132: Summary: MultilayerPerceptronClassifier rawPredictionCol param missing from setParams Key: SPARK-40132 URL: https://issues.apache.org/jira/browse/SPARK-40132 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.3.0 Reporter: Sean R. Owen Assignee: Sean R. Owen https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in Pyspark ML's classification.py but inadvertently removed the parameter rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes its constructor to fail when this param is set in the constructor, as it isn't recognized by setParams, called by the constructor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40131) Support NumPy ndarray in built-in functions
Xinrong Meng created SPARK-40131: Summary: Support NumPy ndarray in built-in functions Key: SPARK-40131 URL: https://issues.apache.org/jira/browse/SPARK-40131 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Per [https://github.com/apache/spark/pull/37560#discussion_r948572473] we want to support NumPy ndarray in built-in functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40130) Support NumPy scalars in built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581090#comment-17581090 ] Apache Spark commented on SPARK-40130: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/37560 > Support NumPy scalars in built-in functions > --- > > Key: SPARK-40130 > URL: https://issues.apache.org/jira/browse/SPARK-40130 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support NumPy scalars in built-in functions by introducing Py4J input > converter `NumpyScalarConverter`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40130) Support NumPy scalars in built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40130: Assignee: (was: Apache Spark) > Support NumPy scalars in built-in functions > --- > > Key: SPARK-40130 > URL: https://issues.apache.org/jira/browse/SPARK-40130 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support NumPy scalars in built-in functions by introducing Py4J input > converter `NumpyScalarConverter`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40130) Support NumPy scalars in built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581091#comment-17581091 ] Apache Spark commented on SPARK-40130: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/37560 > Support NumPy scalars in built-in functions > --- > > Key: SPARK-40130 > URL: https://issues.apache.org/jira/browse/SPARK-40130 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support NumPy scalars in built-in functions by introducing Py4J input > converter `NumpyScalarConverter`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40130) Support NumPy scalars in built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40130: Assignee: Apache Spark > Support NumPy scalars in built-in functions > --- > > Key: SPARK-40130 > URL: https://issues.apache.org/jira/browse/SPARK-40130 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Support NumPy scalars in built-in functions by introducing Py4J input > converter `NumpyScalarConverter`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40130) Support NumPy scalars in built-in functions
Xinrong Meng created SPARK-40130: Summary: Support NumPy scalars in built-in functions Key: SPARK-40130 URL: https://issues.apache.org/jira/browse/SPARK-40130 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Support NumPy scalars in built-in functions by introducing Py4J input converter `NumpyScalarConverter`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581077#comment-17581077 ] L. C. Hsieh commented on SPARK-40128: - Added [~dennishuo] as Spark contributor and assigned this JIRA to him. > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Assignee: Dennis Huo >Priority: Major > Fix For: 3.4.0 > > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem can be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-40128: --- Assignee: Dennis Huo > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Assignee: Dennis Huo >Priority: Major > Fix For: 3.4.0 > > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem can be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581073#comment-17581073 ] Chao Sun commented on SPARK-40128: -- Seems we need to add [~dennishuo] as Spark contributor in order to assign him the JIRA. [~dongjoon] [~viirya] could you help? > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Fix For: 3.4.0 > > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem can be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-40128. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37557 [https://github.com/apache/spark/pull/37557] > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Fix For: 3.4.0 > > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem can be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37544) sequence over dates with month interval is producing incorrect results
[ https://issues.apache.org/jira/browse/SPARK-37544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581068#comment-17581068 ] Apache Spark commented on SPARK-37544: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/37559 > sequence over dates with month interval is producing incorrect results > -- > > Key: SPARK-37544 > URL: https://issues.apache.org/jira/browse/SPARK-37544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: Ubuntu 20, OSX 11.6 > OpenJDK 11, Spark 3.2 >Reporter: Vsevolod Ostapenko >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.3.0, 3.2.2 > > > Sequence function with dates and step interval in months producing unexpected > results. > Here is a sample using Spark 3.2 (though the behavior is the same in 3.1.1 > and presumably earlier): > {{scala> spark.sql("select sequence(date '2021-01-01', date '2022-01-01', > interval '3' month) x, date '2021-01-01' + interval '3' month y").collect()}} > {{res1: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01, > {color:#FF}*2021-03-31, 2021-06-30, 2021-09-30,* > {color}{color:#172b4d}2022-01-01{color}),2021-04-01])}} > Expected result of adding 3 months to the 2021-01-01 is 2021-04-01, while > sequence returns 2021-03-31. > At the same time sequence over timestamps works as expected: > {{scala> spark.sql("select sequence(timestamp '2021-01-01 00:00', timestamp > '2022-01-01 00:00', interval '3' month) x").collect()}} > {{res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01 > 00:00:00.0, *2021-04-01* 00:00:00.0, *2021-07-01* 00:00:00.0, *2021-10-01* > 00:00:00.0, 2022-01-01 00:00:00.0)])}} > > A similar issue was reported in the past - [SPARK-31654] sequence producing > inconsistent intervals for month step - ASF JIRA (apache.org) > It's marked resolved, but the problem is either resurfaced or was never > actually fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40110) Add JDBCWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-40110: Assignee: Kazuyuki Tanimura > Add JDBCWithAQESuite > > > Key: SPARK-40110 > URL: https://issues.apache.org/jira/browse/SPARK-40110 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > > Currently `JDBCSuite` assumes that AQE is always turned off. We should also > test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40110) Add JDBCWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-40110. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37544 [https://github.com/apache/spark/pull/37544] > Add JDBCWithAQESuite > > > Key: SPARK-40110 > URL: https://issues.apache.org/jira/browse/SPARK-40110 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.4.0 > > > Currently `JDBCSuite` assumes that AQE is always turned off. We should also > test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40109) New SQL function: get()
[ https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40109. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37541 [https://github.com/apache/spark/pull/37541] > New SQL function: get() > --- > > Key: SPARK-40109 > URL: https://issues.apache.org/jira/browse/SPARK-40109 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Currently, when accessing array element with invalid index under ANSI SQL > mode, the error is like: > {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 > elements. Use `try_element_at` and increase the array index by 1(the starting > array index is 1 for `try_element_at`) to tolerate accessing element at > invalid index and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. > {quote} > The provided solution is complicated. I suggest introducing a new method > get() which always returns null on an invalid array index. This is from > [https://docs.snowflake.com/en/sql-reference/functions/get.html.] > Since Spark's map access always returns null, let's don't support map type in > the get method for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40129) Decimal multiply can produce the wrong answer because it rounds twice
Robert Joseph Evans created SPARK-40129: --- Summary: Decimal multiply can produce the wrong answer because it rounds twice Key: SPARK-40129 URL: https://issues.apache.org/jira/browse/SPARK-40129 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.4.0 Reporter: Robert Joseph Evans This looks like it has been around for a long time, but I have reproduced it in 3.2.0+ The example here is multiplying Decimal(38, 10) by another Decimal(38, 10), but I think it can be reproduced with other number combinations, and possibly with divide too. {code:java} Seq("9173594185998001607642838421.5479932913").toDF.selectExpr("CAST(value as DECIMAL(38,10)) as a").selectExpr("a * CAST(-12 as DECIMAL(38,10))").show(truncate=false) {code} This produces an answer in Spark of {{-110083130231976019291714061058.575920}} But if I do the calculation in regular java BigDecimal I get {{-110083130231976019291714061058.575919}} {code:java} BigDecimal l = new BigDecimal("9173594185998001607642838421.5479932913"); BigDecimal r = new BigDecimal("-12.00"); BigDecimal prod = l.multiply(r); BigDecimal rounded_prod = prod.setScale(6, RoundingMode.HALF_UP); {code} Spark does essentially all of the same operations, but it used Decimal to do it instead of java's BigDecimal directly. Spark, by way of Decimal, will set a MathContext for the multiply operation that has a max precision of 38 and will do half up rounding. That means that the result of the multiply operation in Spark is {{{}-110083130231976019291714061058.57591950{}}}, but for the java BigDecimal code the result is {{{}-110083130231976019291714061058.575919495600{}}}. Then in CheckOverflow for 3.2.0 and 3.3.0 or in just the regular Multiply expression in 3.4.0 the setScale is called (as a part of Decimal.setPrecision). At that point the already rounded number is rounded yet again resulting in what is arguably a wrong answer by Spark. I have not fully tested this, but it looks like we could just remove the MathContext entirely in Decimal, or set it to UNLIMITED. All of the decimal operations appear to have their own overflow and rounding anyways. If we want to potentially reduce the total memory usage, we could also set the max precision to 39 and truncate (round down) the result in the math context instead. That would then let us round the result correctly in setPrecision afterwards. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581011#comment-17581011 ] Apache Spark commented on SPARK-38954: -- User 'parthchandra' has created a pull request for this issue: https://github.com/apache/spark/pull/37558 > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38954: Assignee: Apache Spark > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Assignee: Apache Spark >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581009#comment-17581009 ] Apache Spark commented on SPARK-38954: -- User 'parthchandra' has created a pull request for this issue: https://github.com/apache/spark/pull/37558 > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38954: Assignee: (was: Apache Spark) > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Huo updated SPARK-40128: --- Docs Text: Added support for keeping vectorized reads enabled for Parquet files using the DELTA_LENGTH_BYTE_ARRAY encoding as a standalone column encoding. Previously, the related DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY encodings were accepted as column encodings, but DELTA_LENGTH_BYTE_ARRAY would still be rejected as "unsupported". > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem can be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580971#comment-17580971 ] Marcelo Rossini Castro edited comment on SPARK-40063 at 8/17/22 7:36 PM: - Normally I use the default, {{{}distributed-sequence{}}}, but I already tried {{sequence}} too and I get the same error. So, I tried it again, combining with {{compute.ordered_head}} enabled. This operation requires me to use {{compute.ops_on_diff_frames}} enabled, I think it's worth mentioning. was (Author: JIRAUSER294354): Normally I use the default, {{{}distributed-sequence{}}}, but I already tried {{sequence}} too and I get the same error. So, I tried it again, combining with {{compute.ordered_head}} enabled. I'm having to use {{compute.ops_on_diff_frames}} enabled, I think it's worth mentioning. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['col_to_apply_function'] = df.apply(lambda row: > example_func(row['col_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580971#comment-17580971 ] Marcelo Rossini Castro commented on SPARK-40063: Normally I use the default, {{{}distributed-sequence{}}}, but I already tried {{sequence}} too and I get the same error. So, I tried it again, combining with {{compute.ordered_head}} enabled. I'm having to use {{compute.ops_on_diff_frames}} enabled, I think it's worth mentioning. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['col_to_apply_function'] = df.apply(lambda row: > example_func(row['col_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580929#comment-17580929 ] Parth Chandra commented on SPARK-38954: --- Sorry about the delay, should have updated the JIRA. I ran into some testing issues but those are now resolved. Getting it ready now. > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Huo updated SPARK-40128: --- Description: Even though https://issues.apache.org/jira/browse/SPARK-36879 added implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). Even though there apparently aren't many writers of the standalone DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 and could be more efficient for types of binary/string data that don't take good advantage of sharing common prefixes for incremental encoding. The problem can be reproduced by trying to load one of the [https://github.com/apache/parquet-testing] files (delta_length_byte_array.parquet). was: Even though https://issues.apache.org/jira/browse/SPARK-36879 added implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). Even though there apparently aren't many writers of the standalone DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 and could be more efficient for types of binary/string data that don't take good advantage of sharing common prefixes for incremental encoding. The problem and be reproduced by trying to load one of the [https://github.com/apache/parquet-testing] files (delta_length_byte_array.parquet). > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem can be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580899#comment-17580899 ] Apache Spark commented on SPARK-40128: -- User 'sfc-gh-dhuo' has created a pull request for this issue: https://github.com/apache/spark/pull/37557 > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem and be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40128: Assignee: (was: Apache Spark) > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem and be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580898#comment-17580898 ] Apache Spark commented on SPARK-40128: -- User 'sfc-gh-dhuo' has created a pull request for this issue: https://github.com/apache/spark/pull/37557 > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem and be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40128: Assignee: Apache Spark > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Assignee: Apache Spark >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem and be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Huo updated SPARK-40128: --- Attachment: delta_length_byte_array.parquet > Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in > VectorizedColumnReader > - > > Key: SPARK-40128 > URL: https://issues.apache.org/jira/browse/SPARK-40128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dennis Huo >Priority: Major > Attachments: delta_length_byte_array.parquet > > > Even though https://issues.apache.org/jira/browse/SPARK-36879 added > implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and > DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and > DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with > DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of > DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). > Even though there apparently aren't many writers of the standalone > DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: > https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 > and could be more efficient for types of binary/string data that don't take > good advantage of sharing common prefixes for incremental encoding. > The problem and be reproduced by trying to load one of the > [https://github.com/apache/parquet-testing] files > (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
Dennis Huo created SPARK-40128: -- Summary: Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader Key: SPARK-40128 URL: https://issues.apache.org/jira/browse/SPARK-40128 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Dennis Huo Even though https://issues.apache.org/jira/browse/SPARK-36879 added implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes). Even though there apparently aren't many writers of the standalone DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 and could be more efficient for types of binary/string data that don't take good advantage of sharing common prefixes for incremental encoding. The problem and be reproduced by trying to load one of the [https://github.com/apache/parquet-testing] files (delta_length_byte_array.parquet). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40127) FaultToleranceTest should in test dir
Yang Jie created SPARK-40127: Summary: FaultToleranceTest should in test dir Key: SPARK-40127 URL: https://issues.apache.org/jira/browse/SPARK-40127 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Yang Jie FaultToleranceTest in core module src dir and it was not tested using GA -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39799) DataSourceV2: View catalog interface
[ https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580873#comment-17580873 ] Apache Spark commented on SPARK-39799: -- User 'jzhuge' has created a pull request for this issue: https://github.com/apache/spark/pull/37556 > DataSourceV2: View catalog interface > > > Key: SPARK-39799 > URL: https://issues.apache.org/jira/browse/SPARK-39799 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > The view catalog interfaces. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39799) DataSourceV2: View catalog interface
[ https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39799: Assignee: Apache Spark > DataSourceV2: View catalog interface > > > Key: SPARK-39799 > URL: https://issues.apache.org/jira/browse/SPARK-39799 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: John Zhuge >Assignee: Apache Spark >Priority: Major > > The view catalog interfaces. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39799) DataSourceV2: View catalog interface
[ https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39799: Assignee: (was: Apache Spark) > DataSourceV2: View catalog interface > > > Key: SPARK-39799 > URL: https://issues.apache.org/jira/browse/SPARK-39799 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > The view catalog interfaces. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39799) DataSourceV2: View catalog interface
[ https://issues.apache.org/jira/browse/SPARK-39799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580872#comment-17580872 ] Apache Spark commented on SPARK-39799: -- User 'jzhuge' has created a pull request for this issue: https://github.com/apache/spark/pull/37556 > DataSourceV2: View catalog interface > > > Key: SPARK-39799 > URL: https://issues.apache.org/jira/browse/SPARK-39799 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > The view catalog interfaces. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40126) Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability
[ https://issues.apache.org/jira/browse/SPARK-40126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Tan updated SPARK-40126: -- Description: Dear Spark Team, Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I discovered the following vulnerability/scan results within the image : Type: VULNERABILITY Name: DSA-5169-1 CVSS Score v3: 9.8 Severity: critical The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to upgrade the version of openssl to 1.1.1n-0+deb11u3 Steps to reproduce: Install trivy [https://aquasecurity.github.io/trivy/v0.18.3/installation/] trivy image docker.io/apache/spark:v3.3.0 was: Dear Spark Team, Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I discovered the following vulnerability/scan results within the image : Type: VULNERABILITY Name: DSA-5169-1 CVSS Score v3: 9.8 Severity: critical The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to upgrade the version of openssl to 1.1.1n-0+deb11u3 > Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical > vulnerability > > > Key: SPARK-40126 > URL: https://issues.apache.org/jira/browse/SPARK-40126 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.3.0 >Reporter: Jason Tan >Priority: Major > > Dear Spark Team, > Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I > discovered the following vulnerability/scan results within the image : > Type: VULNERABILITY > Name: DSA-5169-1 > CVSS Score v3: 9.8 > Severity: critical > The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to > upgrade the version of openssl to 1.1.1n-0+deb11u3 > Steps to reproduce: > Install trivy [https://aquasecurity.github.io/trivy/v0.18.3/installation/] > trivy image docker.io/apache/spark:v3.3.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40126) Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability
[ https://issues.apache.org/jira/browse/SPARK-40126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Tan updated SPARK-40126: -- Summary: Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability (was: Security scanning spark v3.3.0 results in DSA-5169-1 critical vulnerability) > Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical > vulnerability > > > Key: SPARK-40126 > URL: https://issues.apache.org/jira/browse/SPARK-40126 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.3.0 >Reporter: Jason Tan >Priority: Major > > Dear Spark Team, > Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I > discovered the following vulnerability/scan results within the image : > Type: VULNERABILITY > Name: DSA-5169-1 > CVSS Score v3: 9.8 > Severity: critical > The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to > upgrade the version of openssl to 1.1.1n-0+deb11u3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40126) Security scanning spark v3.3.0 results in DSA-5169-1 critical vulnerability
Jason Tan created SPARK-40126: - Summary: Security scanning spark v3.3.0 results in DSA-5169-1 critical vulnerability Key: SPARK-40126 URL: https://issues.apache.org/jira/browse/SPARK-40126 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.3.0 Reporter: Jason Tan Dear Spark Team, Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I discovered the following vulnerability/scan results within the image : Type: VULNERABILITY Name: DSA-5169-1 CVSS Score v3: 9.8 Severity: critical The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to upgrade the version of openssl to 1.1.1n-0+deb11u3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40119) Add reason for cancelJobGroup
[ https://issues.apache.org/jira/browse/SPARK-40119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40119: Assignee: (was: Apache Spark) > Add reason for cancelJobGroup > -- > > Key: SPARK-40119 > URL: https://issues.apache.org/jira/browse/SPARK-40119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Priority: Minor > > Currently, `cancelJob` supports passing the reason for failure. We use > `cancelJobGroup` in a few cases of async actions. It would be great to pass > reason of cancellation to the job group. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40119) Add reason for cancelJobGroup
[ https://issues.apache.org/jira/browse/SPARK-40119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40119: Assignee: Apache Spark > Add reason for cancelJobGroup > -- > > Key: SPARK-40119 > URL: https://issues.apache.org/jira/browse/SPARK-40119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Assignee: Apache Spark >Priority: Minor > > Currently, `cancelJob` supports passing the reason for failure. We use > `cancelJobGroup` in a few cases of async actions. It would be great to pass > reason of cancellation to the job group. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40119) Add reason for cancelJobGroup
[ https://issues.apache.org/jira/browse/SPARK-40119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580835#comment-17580835 ] Apache Spark commented on SPARK-40119: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/37555 > Add reason for cancelJobGroup > -- > > Key: SPARK-40119 > URL: https://issues.apache.org/jira/browse/SPARK-40119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Priority: Minor > > Currently, `cancelJob` supports passing the reason for failure. We use > `cancelJobGroup` in a few cases of async actions. It would be great to pass > reason of cancellation to the job group. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?
[ https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco resolved SPARK-39623. Resolution: Not A Problem > partitionng by datestamp leads to wrong query on backend? > - > > Key: SPARK-39623 > URL: https://issues.apache.org/jira/browse/SPARK-39623 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dmitry >Priority: Major > > Hello, > I am new to Apache spark, so please bear with me. I would like to report what > seems to me a bug, but may be I am just not understanding something. > My goal is to run data analysis on a spark cluster. Data is stored in > PostgreSQL DB. Tables contained timestamped entries (timestamp with time > zone). > The code look like: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("foo") \ > .config("spark.jars", "/opt/postgresql-42.4.0.jar") \ > .getOrCreate() > df = spark.read \ > .format("jdbc") \ > .option("url", "jdbc:postgresql://example.org:5432/postgres") \ > .option("dbtable", "billing") \ > .option("user", "user") \ > .option("driver", "org.postgresql.Driver") \ > .option("numPartitions", "4") \ > .option("partitionColumn", "datestamp") \ > .option("lowerBound", "2022-01-01 00:00:00") \ > .option("upperBound", "2022-06-26 23:59:59") \ > .option("fetchsize", 100) \ > .load() > t0 = time.time() > print("Number of entries is => ", df.count(), " Time to execute ", > time.time()-t0) > ... > {code} > datestamp is timestamp with time zone. > I see this query on DB backend: > {code:java} > SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or > "datestamp" is null > {code} > The table is huge and entries go way back before > 2022-01-02 11:59:59. So what ends up happening - all workers but one > complete and one remaining continues to process that query which, to me, > looks like it wants to get all the data before 2022-01-02 11:59:59. Which is > not what I intended. > I remedies this by changing to: > {code:python} > .option("dbtable", "(select * from billinginfo where datestamp > '2022 > 01-01-01 00:00:00') as foo") \ > {code} > And that seem to have solved the issue. But this seems kludgy. Am I doing > something wrong or there is a bug in the way partitioning queries are > generated? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40114) Arrow 9.0.0 support with SparkR
[ https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40114. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37553 [https://github.com/apache/spark/pull/37553] > Arrow 9.0.0 support with SparkR > --- > > Key: SPARK-40114 > URL: https://issues.apache.org/jira/browse/SPARK-40114 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code} > == Failed > == > -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:103:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:133:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:143:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:184:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:217:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:229:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow > optimiz > `count(...)` threw an error with unexpected message. > Expected match: "expected IntegerType, IntegerType, got IntegerType, > StringType" > Actual message: "org.apache.spark.SparkException: Job aborted due to stage > failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task > 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): > org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced > errors: The tzdb package is not installed. Timezones will not be available to > Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : > write_arrow has been removed\nCalls: -> writeRaw -> writeInt -> > writeBin -> \nExecution halted\n\r\n\tat > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat >
[jira] [Assigned] (SPARK-40114) Arrow 9.0.0 support with SparkR
[ https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40114: - Assignee: Hyukjin Kwon > Arrow 9.0.0 support with SparkR > --- > > Key: SPARK-40114 > URL: https://issues.apache.org/jira/browse/SPARK-40114 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > {code} > == Failed > == > -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:103:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:133:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:143:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:184:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:217:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:229:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow > optimiz > `count(...)` threw an error with unexpected message. > Expected match: "expected IntegerType, IntegerType, got IntegerType, > StringType" > Actual message: "org.apache.spark.SparkException: Job aborted due to stage > failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task > 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): > org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced > errors: The tzdb package is not installed. Timezones will not be available to > Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : > write_arrow has been removed\nCalls: -> writeRaw -> writeInt -> > writeBin -> \nExecution halted\n\r\n\tat > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown >
[jira] [Commented] (SPARK-40125) Add separate infra image for lint job
[ https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580806#comment-17580806 ] Apache Spark commented on SPARK-40125: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37550 > Add separate infra image for lint job > - > > Key: SPARK-40125 > URL: https://issues.apache.org/jira/browse/SPARK-40125 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > To aovid the issue like [#37243 > (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] > , we had some initial discussion, we'd better move infra image into lint > image to make lint deps static and upgrade manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40125) Add separate infra image for lint job
[ https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40125: Assignee: (was: Apache Spark) > Add separate infra image for lint job > - > > Key: SPARK-40125 > URL: https://issues.apache.org/jira/browse/SPARK-40125 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > To aovid the issue like [#37243 > (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] > , we had some initial discussion, we'd better move infra image into lint > image to make lint deps static and upgrade manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40125) Add separate infra image for lint job
[ https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580805#comment-17580805 ] Apache Spark commented on SPARK-40125: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37550 > Add separate infra image for lint job > - > > Key: SPARK-40125 > URL: https://issues.apache.org/jira/browse/SPARK-40125 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > To aovid the issue like [#37243 > (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] > , we had some initial discussion, we'd better move infra image into lint > image to make lint deps static and upgrade manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40125) Add separate infra image for lint job
[ https://issues.apache.org/jira/browse/SPARK-40125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40125: Assignee: Apache Spark > Add separate infra image for lint job > - > > Key: SPARK-40125 > URL: https://issues.apache.org/jira/browse/SPARK-40125 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > To aovid the issue like [#37243 > (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] > , we had some initial discussion, we'd better move infra image into lint > image to make lint deps static and upgrade manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40125) Add separate infra image for lint job
Yikun Jiang created SPARK-40125: --- Summary: Add separate infra image for lint job Key: SPARK-40125 URL: https://issues.apache.org/jira/browse/SPARK-40125 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.4.0 Reporter: Yikun Jiang To aovid the issue like [#37243 (comment)|https://github.com/apache/spark/pull/37243#issuecomment-1191422150] , we had some initial discussion, we'd better move infra image into lint image to make lint deps static and upgrade manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580791#comment-17580791 ] Apache Spark commented on SPARK-40124: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37554 > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40124: Assignee: Apache Spark > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40124: Assignee: (was: Apache Spark) > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kapil Singh updated SPARK-40124: Summary: Update TPCDS v1.4 q32 for Plan Stability tests (was: Update TPCDS v1.4 query32) > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40124) Update TPCDS v1.4 query32
Kapil Singh created SPARK-40124: --- Summary: Update TPCDS v1.4 query32 Key: SPARK-40124 URL: https://issues.apache.org/jira/browse/SPARK-40124 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kapil Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar
[ https://issues.apache.org/jira/browse/SPARK-40123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] manohar updated SPARK-40123: Flags: Patch Labels: security-issue (was: ) > Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar > > > Key: SPARK-40123 > URL: https://issues.apache.org/jira/browse/SPARK-40123 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 3.3.0 >Reporter: manohar >Priority: Major > Labels: security-issue > Fix For: 3.3.1 > > > Hello Team, > We are facing this vulnerability on Spark Installation 3.3.3 , Can we please > upgrade the version of mesos in our installation to address this > vulnerability. > ||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path|| > |1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, > 1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar| > In our source code i found that the depedant version of mesos jar is 1.4.3 > user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * > core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * > TaskSchedulerImpl. We assume a Mesos-like model where the application gets > resource offers as > *dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar > dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar
manohar created SPARK-40123: --- Summary: Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar Key: SPARK-40123 URL: https://issues.apache.org/jira/browse/SPARK-40123 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 3.3.0 Reporter: manohar Fix For: 3.3.1 Hello Team, We are facing this vulnerability on Spark Installation 3.3.3 , Can we please upgrade the version of mesos in our installation to address this vulnerability. ||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path|| |1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, 1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar| In our source code i found that the depedant version of mesos jar is 1.4.3 user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * TaskSchedulerImpl. We assume a Mesos-like model where the application gets resource offers as *dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-40087: Description: This is a followup on SPARK-39895. The PR previously attempted to adjust implementation for R as well to match signatures but that part was removed and we only focused on getting python implementation to behave correctly. *{{Change supports following operations:}}* {{df <- select(read.json(jsonPath), "name", "age")}} {{df$age2 <- df$age}} {{df1 <- drop(df, df$age, df$name)}} {{expect_equal(columns(df1), c("age2"))}} {{df1 <- drop(df, df$age, column("random"))}} {{expect_equal(columns(df1), c("name", "age2"))}} {{df1 <- drop(df, df$age, df$name)}} {{expect_equal(columns(df1), c("age2"))}} was: This is a followup on SPARK-39895. The PR previously attempted to adjust implementation for R as well to match signatures but that part was removed and we only focused on getting python implementation to behave correctly. *{{Change supports following operations:}}* {{df <- select(read.json(jsonPath), "name", "age")}} {{df$age2 <- df$age}} {{df1 <- drop(df, df$age, df$name)}} {{expect_equal(columns(df1), c("age2"))}} {{df1 <- drop(df, list(df$age, column("random")))}} {{expect_equal(columns(df1), c("name", "age2"))}} {{df1 <- drop(df, list(df$age, df$name))}} {{expect_equal(columns(df1), c("age2"))}} > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Priority: Minor > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, df$age, column("random"))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762 ] Maksim Grinman edited comment on SPARK-38330 at 8/17/22 12:28 PM: -- Thanks for the response. I did try compiling myself from the Spark github repo in the v3.3.0 tagged commit (with hadoop-aws jar added in the pom) and generating the python wheel to see what's in the python wheel and none of them have cos in the name: {code:java} [110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars total 921296 rw-rr- 1 maks staff 227K Aug 12 13:21 JLargeArrays-1.5.jar rw-rr- 1 maks staff 1.1M Aug 12 13:21 JTransforms-3.1.jar rw-rr- 1 maks staff 418K Aug 12 13:18 RoaringBitmap-0.9.25.jar rw-rr- 1 maks staff 68K Oct 4 2021 activation-1.1.1.jar rw-rr- 1 maks staff 179K Aug 12 13:25 aircompressor-0.21.jar rw-rr- 1 maks staff 1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar rw-rr- 1 maks staff 19K Aug 12 13:25 annotations-17.0.0.jar rw-rr- 1 maks staff 330K Aug 12 13:23 antlr4-runtime-4.8.jar rw-rr- 1 maks staff 26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar rw-rr- 1 maks staff 76K Aug 12 13:28 arpack-2.2.1.jar rw-rr- 1 maks staff 1.1M Oct 4 2021 arpack_combined_all-0.1.jar rw-rr- 1 maks staff 107K Aug 12 13:23 arrow-format-7.0.0.jar rw-rr- 1 maks staff 106K Aug 12 13:23 arrow-memory-core-7.0.0.jar rw-rr- 1 maks staff 38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar rw-rr- 1 maks staff 1.8M Aug 12 13:23 arrow-vector-7.0.0.jar rw-rr- 1 maks staff 20K Aug 12 13:19 audience-annotations-0.5.0.jar rw-rr- 1 maks staff 580K Aug 12 13:19 avro-1.11.0.jar rw-rr- 1 maks staff 181K Aug 12 13:19 avro-ipc-1.11.0.jar rw-rr- 1 maks staff 184K Aug 12 13:19 avro-mapred-1.11.0.jar rw-rr- 1 maks staff 216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar rw-rr- 1 maks staff 194K Aug 12 13:21 blas-2.2.1.jar rw-rr- 1 maks staff 73K Aug 12 13:21 breeze-macros_2.12-1.2.jar rw-rr- 1 maks staff 13M Aug 12 13:21 breeze_2.12-1.2.jar rw-rr- 1 maks staff 3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar rw-rr- 1 maks staff 57K Aug 12 13:18 chill-java-0.10.0.jar rw-rr- 1 maks staff 207K Aug 12 13:18 chill_2.12-0.10.0.jar rw-rr- 1 maks staff 346K Aug 12 13:18 commons-codec-1.15.jar rw-rr- 1 maks staff 575K Oct 4 2021 commons-collections-3.2.2.jar rw-rr- 1 maks staff 734K Aug 12 13:19 commons-collections4-4.4.jar rw-rr- 1 maks staff 70K Aug 12 13:23 commons-compiler-3.0.16.jar rw-rr- 1 maks staff 994K Aug 12 13:19 commons-compress-1.21.jar rw-rr- 1 maks staff 162K Aug 12 13:18 commons-crypto-1.1.0.jar rw-rr- 1 maks staff 319K Aug 12 13:18 commons-io-2.11.0.jar rw-rr- 1 maks staff 278K Jan 15 2021 commons-lang-2.6.jar rw-rr- 1 maks staff 574K Aug 12 13:18 commons-lang3-3.12.0.jar rw-rr- 1 maks staff 61K Oct 4 2021 commons-logging-1.1.3.jar rw-rr- 1 maks staff 2.1M Aug 12 13:19 commons-math3-3.6.1.jar rw-rr- 1 maks staff 211K Aug 12 13:18 commons-text-1.9.jar rw-rr- 1 maks staff 80K Aug 12 13:19 compress-lzf-1.1.jar rw-rr- 1 maks staff 161K Oct 4 2021 core-1.1.2.jar rw-rr- 1 maks staff 2.3M Aug 12 13:19 curator-client-2.13.0.jar rw-rr- 1 maks staff 197K Aug 12 13:19 curator-framework-2.13.0.jar rw-rr- 1 maks staff 277K Aug 12 13:19 curator-recipes-2.13.0.jar rw-rr- 1 maks staff 63K Aug 12 13:23 flatbuffers-java-1.12.0.jar rw-rr- 1 maks staff 235K Aug 12 13:18 gson-2.8.6.jar rw-rr- 1 maks staff 2.1M Oct 4 2021 guava-14.0.1.jar rw-rr- 1 maks staff 940K Aug 12 13:19 hadoop-aws-3.3.2.jar rw-rr- 1 maks staff 19M Aug 12 13:19 hadoop-client-api-3.3.2.jar rw-rr- 1 maks staff 29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar rw-rr- 1 maks staff 3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar rw-rr- 1 maks staff 55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar rw-rr- 1 maks staff 231K Aug 12 13:25 hive-storage-api-2.7.2.jar rw-rr- 1 maks staff 196K Aug 12 13:19 hk2-api-2.6.1.jar rw-rr- 1 maks staff 199K Aug 12 13:19 hk2-locator-2.6.1.jar rw-rr- 1 maks staff 129K Aug 12 13:19 hk2-utils-2.6.1.jar rw-rr- 1 maks staff 27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar rw-rr- 1 maks staff 1.3M Aug 12 13:19 ivy-2.5.0.jar rw-rr- 1 maks staff 74K Aug 12 13:18 jackson-annotations-2.13.3.jar rw-rr- 1 maks staff 366K Aug 12 13:18 jackson-core-2.13.3.jar rw-rr- 1 maks staff 1.5M Aug 12 13:18 jackson-databind-2.13.3.jar rw-rr- 1 maks staff 448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar rw-rr- 1 maks staff 24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar rw-rr- 1 maks staff 18K Aug 12 13:19 jakarta.inject-2.6.1.jar rw-rr- 1 maks staff 81K Aug 12 13:19 jakarta.servlet-api-4.0.3.jar rw-rr- 1 maks staff 90K Aug 12 13:19 jakarta.validation-api-2.0.2.jar rw-rr- 1
[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762 ] Maksim Grinman edited comment on SPARK-38330 at 8/17/22 12:23 PM: -- Thanks for the response. I did try compiling myself (with hadoop-aws jar included) and generating the python wheel in Python 3.3 to see what's in the python wheel and none of them have cos in the name: {code:java} [110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars total 921296 rw-rr- 1 maks staff 227K Aug 12 13:21 JLargeArrays-1.5.jar rw-rr- 1 maks staff 1.1M Aug 12 13:21 JTransforms-3.1.jar rw-rr- 1 maks staff 418K Aug 12 13:18 RoaringBitmap-0.9.25.jar rw-rr- 1 maks staff 68K Oct 4 2021 activation-1.1.1.jar rw-rr- 1 maks staff 179K Aug 12 13:25 aircompressor-0.21.jar rw-rr- 1 maks staff 1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar rw-rr- 1 maks staff 19K Aug 12 13:25 annotations-17.0.0.jar rw-rr- 1 maks staff 330K Aug 12 13:23 antlr4-runtime-4.8.jar rw-rr- 1 maks staff 26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar rw-rr- 1 maks staff 76K Aug 12 13:28 arpack-2.2.1.jar rw-rr- 1 maks staff 1.1M Oct 4 2021 arpack_combined_all-0.1.jar rw-rr- 1 maks staff 107K Aug 12 13:23 arrow-format-7.0.0.jar rw-rr- 1 maks staff 106K Aug 12 13:23 arrow-memory-core-7.0.0.jar rw-rr- 1 maks staff 38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar rw-rr- 1 maks staff 1.8M Aug 12 13:23 arrow-vector-7.0.0.jar rw-rr- 1 maks staff 20K Aug 12 13:19 audience-annotations-0.5.0.jar rw-rr- 1 maks staff 580K Aug 12 13:19 avro-1.11.0.jar rw-rr- 1 maks staff 181K Aug 12 13:19 avro-ipc-1.11.0.jar rw-rr- 1 maks staff 184K Aug 12 13:19 avro-mapred-1.11.0.jar rw-rr- 1 maks staff 216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar rw-rr- 1 maks staff 194K Aug 12 13:21 blas-2.2.1.jar rw-rr- 1 maks staff 73K Aug 12 13:21 breeze-macros_2.12-1.2.jar rw-rr- 1 maks staff 13M Aug 12 13:21 breeze_2.12-1.2.jar rw-rr- 1 maks staff 3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar rw-rr- 1 maks staff 57K Aug 12 13:18 chill-java-0.10.0.jar rw-rr- 1 maks staff 207K Aug 12 13:18 chill_2.12-0.10.0.jar rw-rr- 1 maks staff 346K Aug 12 13:18 commons-codec-1.15.jar rw-rr- 1 maks staff 575K Oct 4 2021 commons-collections-3.2.2.jar rw-rr- 1 maks staff 734K Aug 12 13:19 commons-collections4-4.4.jar rw-rr- 1 maks staff 70K Aug 12 13:23 commons-compiler-3.0.16.jar rw-rr- 1 maks staff 994K Aug 12 13:19 commons-compress-1.21.jar rw-rr- 1 maks staff 162K Aug 12 13:18 commons-crypto-1.1.0.jar rw-rr- 1 maks staff 319K Aug 12 13:18 commons-io-2.11.0.jar rw-rr- 1 maks staff 278K Jan 15 2021 commons-lang-2.6.jar rw-rr- 1 maks staff 574K Aug 12 13:18 commons-lang3-3.12.0.jar rw-rr- 1 maks staff 61K Oct 4 2021 commons-logging-1.1.3.jar rw-rr- 1 maks staff 2.1M Aug 12 13:19 commons-math3-3.6.1.jar rw-rr- 1 maks staff 211K Aug 12 13:18 commons-text-1.9.jar rw-rr- 1 maks staff 80K Aug 12 13:19 compress-lzf-1.1.jar rw-rr- 1 maks staff 161K Oct 4 2021 core-1.1.2.jar rw-rr- 1 maks staff 2.3M Aug 12 13:19 curator-client-2.13.0.jar rw-rr- 1 maks staff 197K Aug 12 13:19 curator-framework-2.13.0.jar rw-rr- 1 maks staff 277K Aug 12 13:19 curator-recipes-2.13.0.jar rw-rr- 1 maks staff 63K Aug 12 13:23 flatbuffers-java-1.12.0.jar rw-rr- 1 maks staff 235K Aug 12 13:18 gson-2.8.6.jar rw-rr- 1 maks staff 2.1M Oct 4 2021 guava-14.0.1.jar rw-rr- 1 maks staff 940K Aug 12 13:19 hadoop-aws-3.3.2.jar rw-rr- 1 maks staff 19M Aug 12 13:19 hadoop-client-api-3.3.2.jar rw-rr- 1 maks staff 29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar rw-rr- 1 maks staff 3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar rw-rr- 1 maks staff 55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar rw-rr- 1 maks staff 231K Aug 12 13:25 hive-storage-api-2.7.2.jar rw-rr- 1 maks staff 196K Aug 12 13:19 hk2-api-2.6.1.jar rw-rr- 1 maks staff 199K Aug 12 13:19 hk2-locator-2.6.1.jar rw-rr- 1 maks staff 129K Aug 12 13:19 hk2-utils-2.6.1.jar rw-rr- 1 maks staff 27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar rw-rr- 1 maks staff 1.3M Aug 12 13:19 ivy-2.5.0.jar rw-rr- 1 maks staff 74K Aug 12 13:18 jackson-annotations-2.13.3.jar rw-rr- 1 maks staff 366K Aug 12 13:18 jackson-core-2.13.3.jar rw-rr- 1 maks staff 1.5M Aug 12 13:18 jackson-databind-2.13.3.jar rw-rr- 1 maks staff 448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar rw-rr- 1 maks staff 24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar rw-rr- 1 maks staff 18K Aug 12 13:19 jakarta.inject-2.6.1.jar rw-rr- 1 maks staff 81K Aug 12 13:19 jakarta.servlet-api-4.0.3.jar rw-rr- 1 maks staff 90K Aug 12 13:19 jakarta.validation-api-2.0.2.jar rw-rr- 1 maks staff 137K Aug 12 13:19
[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762 ] Maksim Grinman edited comment on SPARK-38330 at 8/17/22 12:22 PM: -- Thanks for the response. I did try compiling myself and generating the python wheel in Python 3.3 to see what's in the python wheel and none of them have cos in the name: {code:java} [110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars total 921296 rw-rr- 1 maks staff 227K Aug 12 13:21 JLargeArrays-1.5.jar rw-rr- 1 maks staff 1.1M Aug 12 13:21 JTransforms-3.1.jar rw-rr- 1 maks staff 418K Aug 12 13:18 RoaringBitmap-0.9.25.jar rw-rr- 1 maks staff 68K Oct 4 2021 activation-1.1.1.jar rw-rr- 1 maks staff 179K Aug 12 13:25 aircompressor-0.21.jar rw-rr- 1 maks staff 1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar rw-rr- 1 maks staff 19K Aug 12 13:25 annotations-17.0.0.jar rw-rr- 1 maks staff 330K Aug 12 13:23 antlr4-runtime-4.8.jar rw-rr- 1 maks staff 26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar rw-rr- 1 maks staff 76K Aug 12 13:28 arpack-2.2.1.jar rw-rr- 1 maks staff 1.1M Oct 4 2021 arpack_combined_all-0.1.jar rw-rr- 1 maks staff 107K Aug 12 13:23 arrow-format-7.0.0.jar rw-rr- 1 maks staff 106K Aug 12 13:23 arrow-memory-core-7.0.0.jar rw-rr- 1 maks staff 38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar rw-rr- 1 maks staff 1.8M Aug 12 13:23 arrow-vector-7.0.0.jar rw-rr- 1 maks staff 20K Aug 12 13:19 audience-annotations-0.5.0.jar rw-rr- 1 maks staff 580K Aug 12 13:19 avro-1.11.0.jar rw-rr- 1 maks staff 181K Aug 12 13:19 avro-ipc-1.11.0.jar rw-rr- 1 maks staff 184K Aug 12 13:19 avro-mapred-1.11.0.jar rw-rr- 1 maks staff 216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar rw-rr- 1 maks staff 194K Aug 12 13:21 blas-2.2.1.jar rw-rr- 1 maks staff 73K Aug 12 13:21 breeze-macros_2.12-1.2.jar rw-rr- 1 maks staff 13M Aug 12 13:21 breeze_2.12-1.2.jar rw-rr- 1 maks staff 3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar rw-rr- 1 maks staff 57K Aug 12 13:18 chill-java-0.10.0.jar rw-rr- 1 maks staff 207K Aug 12 13:18 chill_2.12-0.10.0.jar rw-rr- 1 maks staff 346K Aug 12 13:18 commons-codec-1.15.jar rw-rr- 1 maks staff 575K Oct 4 2021 commons-collections-3.2.2.jar rw-rr- 1 maks staff 734K Aug 12 13:19 commons-collections4-4.4.jar rw-rr- 1 maks staff 70K Aug 12 13:23 commons-compiler-3.0.16.jar rw-rr- 1 maks staff 994K Aug 12 13:19 commons-compress-1.21.jar rw-rr- 1 maks staff 162K Aug 12 13:18 commons-crypto-1.1.0.jar rw-rr- 1 maks staff 319K Aug 12 13:18 commons-io-2.11.0.jar rw-rr- 1 maks staff 278K Jan 15 2021 commons-lang-2.6.jar rw-rr- 1 maks staff 574K Aug 12 13:18 commons-lang3-3.12.0.jar rw-rr- 1 maks staff 61K Oct 4 2021 commons-logging-1.1.3.jar rw-rr- 1 maks staff 2.1M Aug 12 13:19 commons-math3-3.6.1.jar rw-rr- 1 maks staff 211K Aug 12 13:18 commons-text-1.9.jar rw-rr- 1 maks staff 80K Aug 12 13:19 compress-lzf-1.1.jar rw-rr- 1 maks staff 161K Oct 4 2021 core-1.1.2.jar rw-rr- 1 maks staff 2.3M Aug 12 13:19 curator-client-2.13.0.jar rw-rr- 1 maks staff 197K Aug 12 13:19 curator-framework-2.13.0.jar rw-rr- 1 maks staff 277K Aug 12 13:19 curator-recipes-2.13.0.jar rw-rr- 1 maks staff 63K Aug 12 13:23 flatbuffers-java-1.12.0.jar rw-rr- 1 maks staff 235K Aug 12 13:18 gson-2.8.6.jar rw-rr- 1 maks staff 2.1M Oct 4 2021 guava-14.0.1.jar rw-rr- 1 maks staff 940K Aug 12 13:19 hadoop-aws-3.3.2.jar rw-rr- 1 maks staff 19M Aug 12 13:19 hadoop-client-api-3.3.2.jar rw-rr- 1 maks staff 29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar rw-rr- 1 maks staff 3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar rw-rr- 1 maks staff 55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar rw-rr- 1 maks staff 231K Aug 12 13:25 hive-storage-api-2.7.2.jar rw-rr- 1 maks staff 196K Aug 12 13:19 hk2-api-2.6.1.jar rw-rr- 1 maks staff 199K Aug 12 13:19 hk2-locator-2.6.1.jar rw-rr- 1 maks staff 129K Aug 12 13:19 hk2-utils-2.6.1.jar rw-rr- 1 maks staff 27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar rw-rr- 1 maks staff 1.3M Aug 12 13:19 ivy-2.5.0.jar rw-rr- 1 maks staff 74K Aug 12 13:18 jackson-annotations-2.13.3.jar rw-rr- 1 maks staff 366K Aug 12 13:18 jackson-core-2.13.3.jar rw-rr- 1 maks staff 1.5M Aug 12 13:18 jackson-databind-2.13.3.jar rw-rr- 1 maks staff 448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar rw-rr- 1 maks staff 24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar rw-rr- 1 maks staff 18K Aug 12 13:19 jakarta.inject-2.6.1.jar rw-rr- 1 maks staff 81K Aug 12 13:19 jakarta.servlet-api-4.0.3.jar rw-rr- 1 maks staff 90K Aug 12 13:19 jakarta.validation-api-2.0.2.jar rw-rr- 1 maks staff 137K Aug 12 13:19 jakarta.ws.rs-api-2.1.6.jar rw-rr- 1 maks
[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580762#comment-17580762 ] Maksim Grinman commented on SPARK-38330: Thanks for the response. I did try compiling myself and generating the python wheel in Python 3.3 to see what's in the python wheel and none of them have cos in the name: ``` [110] → ls -l python/build/bdist.linux-x86_64/wheel/pyspark/jars total 921296 -rw-r--r-- 1 maks staff 227K Aug 12 13:21 JLargeArrays-1.5.jar -rw-r--r-- 1 maks staff 1.1M Aug 12 13:21 JTransforms-3.1.jar -rw-r--r-- 1 maks staff 418K Aug 12 13:18 RoaringBitmap-0.9.25.jar -rw-r--r-- 1 maks staff 68K Oct 4 2021 activation-1.1.1.jar -rw-r--r-- 1 maks staff 179K Aug 12 13:25 aircompressor-0.21.jar -rw-r--r-- 1 maks staff 1.1M Aug 12 13:21 algebra_2.12-2.0.1.jar -rw-r--r-- 1 maks staff 19K Aug 12 13:25 annotations-17.0.0.jar -rw-r--r-- 1 maks staff 330K Aug 12 13:23 antlr4-runtime-4.8.jar -rw-r--r-- 1 maks staff 26K Aug 12 13:19 aopalliance-repackaged-2.6.1.jar -rw-r--r-- 1 maks staff 76K Aug 12 13:28 arpack-2.2.1.jar -rw-r--r-- 1 maks staff 1.1M Oct 4 2021 arpack_combined_all-0.1.jar -rw-r--r-- 1 maks staff 107K Aug 12 13:23 arrow-format-7.0.0.jar -rw-r--r-- 1 maks staff 106K Aug 12 13:23 arrow-memory-core-7.0.0.jar -rw-r--r-- 1 maks staff 38K Aug 12 13:23 arrow-memory-netty-7.0.0.jar -rw-r--r-- 1 maks staff 1.8M Aug 12 13:23 arrow-vector-7.0.0.jar -rw-r--r-- 1 maks staff 20K Aug 12 13:19 audience-annotations-0.5.0.jar -rw-r--r-- 1 maks staff 580K Aug 12 13:19 avro-1.11.0.jar -rw-r--r-- 1 maks staff 181K Aug 12 13:19 avro-ipc-1.11.0.jar -rw-r--r-- 1 maks staff 184K Aug 12 13:19 avro-mapred-1.11.0.jar -rw-r--r-- 1 maks staff 216M Aug 12 13:19 aws-java-sdk-bundle-1.11.1026.jar -rw-r--r-- 1 maks staff 194K Aug 12 13:21 blas-2.2.1.jar -rw-r--r-- 1 maks staff 73K Aug 12 13:21 breeze-macros_2.12-1.2.jar -rw-r--r-- 1 maks staff 13M Aug 12 13:21 breeze_2.12-1.2.jar -rw-r--r-- 1 maks staff 3.2M Aug 12 13:21 cats-kernel_2.12-2.1.1.jar -rw-r--r-- 1 maks staff 57K Aug 12 13:18 chill-java-0.10.0.jar -rw-r--r-- 1 maks staff 207K Aug 12 13:18 chill_2.12-0.10.0.jar -rw-r--r-- 1 maks staff 346K Aug 12 13:18 commons-codec-1.15.jar -rw-r--r-- 1 maks staff 575K Oct 4 2021 commons-collections-3.2.2.jar -rw-r--r-- 1 maks staff 734K Aug 12 13:19 commons-collections4-4.4.jar -rw-r--r-- 1 maks staff 70K Aug 12 13:23 commons-compiler-3.0.16.jar -rw-r--r-- 1 maks staff 994K Aug 12 13:19 commons-compress-1.21.jar -rw-r--r-- 1 maks staff 162K Aug 12 13:18 commons-crypto-1.1.0.jar -rw-r--r-- 1 maks staff 319K Aug 12 13:18 commons-io-2.11.0.jar -rw-r--r-- 1 maks staff 278K Jan 15 2021 commons-lang-2.6.jar -rw-r--r-- 1 maks staff 574K Aug 12 13:18 commons-lang3-3.12.0.jar -rw-r--r-- 1 maks staff 61K Oct 4 2021 commons-logging-1.1.3.jar -rw-r--r-- 1 maks staff 2.1M Aug 12 13:19 commons-math3-3.6.1.jar -rw-r--r-- 1 maks staff 211K Aug 12 13:18 commons-text-1.9.jar -rw-r--r-- 1 maks staff 80K Aug 12 13:19 compress-lzf-1.1.jar -rw-r--r-- 1 maks staff 161K Oct 4 2021 core-1.1.2.jar -rw-r--r-- 1 maks staff 2.3M Aug 12 13:19 curator-client-2.13.0.jar -rw-r--r-- 1 maks staff 197K Aug 12 13:19 curator-framework-2.13.0.jar -rw-r--r-- 1 maks staff 277K Aug 12 13:19 curator-recipes-2.13.0.jar -rw-r--r-- 1 maks staff 63K Aug 12 13:23 flatbuffers-java-1.12.0.jar -rw-r--r-- 1 maks staff 235K Aug 12 13:18 gson-2.8.6.jar -rw-r--r-- 1 maks staff 2.1M Oct 4 2021 guava-14.0.1.jar -rw-r--r-- 1 maks staff 940K Aug 12 13:19 hadoop-aws-3.3.2.jar -rw-r--r-- 1 maks staff 19M Aug 12 13:19 hadoop-client-api-3.3.2.jar -rw-r--r-- 1 maks staff 29M Aug 12 13:19 hadoop-client-runtime-3.3.2.jar -rw-r--r-- 1 maks staff 3.2M Aug 12 14:02 hadoop-shaded-guava-1.1.1.jar -rw-r--r-- 1 maks staff 55K Aug 12 14:02 hadoop-yarn-server-web-proxy-3.3.2.jar -rw-r--r-- 1 maks staff 231K Aug 12 13:25 hive-storage-api-2.7.2.jar -rw-r--r-- 1 maks staff 196K Aug 12 13:19 hk2-api-2.6.1.jar -rw-r--r-- 1 maks staff 199K Aug 12 13:19 hk2-locator-2.6.1.jar -rw-r--r-- 1 maks staff 129K Aug 12 13:19 hk2-utils-2.6.1.jar -rw-r--r-- 1 maks staff 27K Aug 12 13:28 istack-commons-runtime-3.0.8.jar -rw-r--r-- 1 maks staff 1.3M Aug 12 13:19 ivy-2.5.0.jar -rw-r--r-- 1 maks staff 74K Aug 12 13:18 jackson-annotations-2.13.3.jar -rw-r--r-- 1 maks staff 366K Aug 12 13:18 jackson-core-2.13.3.jar -rw-r--r-- 1 maks staff 1.5M Aug 12 13:18 jackson-databind-2.13.3.jar -rw-r--r-- 1 maks staff 448K Aug 12 13:19 jackson-module-scala_2.12-2.13.3.jar -rw-r--r-- 1 maks staff 24K Aug 12 13:19 jakarta.annotation-api-1.3.5.jar -rw-r--r-- 1 maks staff 18K Aug 12 13:19 jakarta.inject-2.6.1.jar -rw-r--r-- 1 maks staff 81K Aug 12
[jira] [Updated] (SPARK-40122) py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-40122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ihor Bobak updated SPARK-40122: --- Description: Without any visible reason I am getting this error in my Jupyter notebook (see stacktrace below) with pyspark kernel. Often it occurs even if no Spark operations are made, e.g. when I am working with multiprocessing Pool for a local piece of code that should parallelize on the cores of the driver, with no spark transformations/actions done in that jupyter cell. INFO:py4j.clientserver:Error while sending or receiving. Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command self.socket.sendall(command.encode("utf-8")) ConnectionResetError: [Errno 104] Connection reset by peer INFO:py4j.clientserver:Closing down clientserver connection INFO:root:Exception while sending command. Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command self.socket.sendall(command.encode("utf-8")) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending INFO:py4j.clientserver:Closing down clientserver connection was: Without any visible reason I am getting this error in my Jupyter notebook (see stacktrace below) with pyspark kernel. Often it occurs even if no Spark operations are made, e.g. when I am working with multiprocessing Pool for a local piece of code that should parallelize on the cores of the driver, with no spark transformations/actions done in that jupyter cell. {{INFO:py4j.clientserver:Error while sending or receiving. Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command self.socket.sendall(command.encode("utf-8")) ConnectionResetError: [Errno 104] Connection reset by peer INFO:py4j.clientserver:Closing down clientserver connection INFO:root:Exception while sending command. Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command self.socket.sendall(command.encode("utf-8")) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending INFO:py4j.clientserver:Closing down clientserver connection }} > py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0 > --- > > Key: SPARK-40122 > URL: https://issues.apache.org/jira/browse/SPARK-40122 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Ihor Bobak >Priority: Major > > Without any visible reason I am getting this error in my Jupyter notebook > (see stacktrace below) with pyspark kernel. Often it occurs even if no Spark > operations are made, e.g. when I am working with multiprocessing Pool for a > local piece of code that should parallelize on the cores of the driver, with > no spark transformations/actions done in that jupyter cell. > INFO:py4j.clientserver:Error while sending or receiving. > Traceback (most recent call last): > File > "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", > line 503, in send_command > self.socket.sendall(command.encode("utf-8")) > ConnectionResetError: [Errno 104] Connection reset by peer > INFO:py4j.clientserver:Closing down clientserver connection > INFO:root:Exception while sending command. > Traceback (most recent call last): > File > "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", > line 503, in send_command > self.socket.sendall(command.encode("utf-8")) > ConnectionResetError: [Errno 104]
[jira] [Created] (SPARK-40122) py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0
Ihor Bobak created SPARK-40122: -- Summary: py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0 Key: SPARK-40122 URL: https://issues.apache.org/jira/browse/SPARK-40122 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: Ihor Bobak Without any visible reason I am getting this error in my Jupyter notebook (see stacktrace below) with pyspark kernel. Often it occurs even if no Spark operations are made, e.g. when I am working with multiprocessing Pool for a local piece of code that should parallelize on the cores of the driver, with no spark transformations/actions done in that jupyter cell. {{INFO:py4j.clientserver:Error while sending or receiving. Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command self.socket.sendall(command.encode("utf-8")) ConnectionResetError: [Errno 104] Connection reset by peer INFO:py4j.clientserver:Closing down clientserver connection INFO:root:Exception while sending command. Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command self.socket.sendall(command.encode("utf-8")) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending INFO:py4j.clientserver:Closing down clientserver connection }} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40114) Arrow 9.0.0 support with SparkR
[ https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580731#comment-17580731 ] Apache Spark commented on SPARK-40114: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37553 > Arrow 9.0.0 support with SparkR > --- > > Key: SPARK-40114 > URL: https://issues.apache.org/jira/browse/SPARK-40114 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > == Failed > == > -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:103:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:133:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:143:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:184:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:217:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:229:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow > optimiz > `count(...)` threw an error with unexpected message. > Expected match: "expected IntegerType, IntegerType, got IntegerType, > StringType" > Actual message: "org.apache.spark.SparkException: Job aborted due to stage > failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task > 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): > org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced > errors: The tzdb package is not installed. Timezones will not be available to > Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : > write_arrow has been removed\nCalls: -> writeRaw -> writeInt -> > writeBin -> \nExecution halted\n\r\n\tat > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat >
[jira] [Assigned] (SPARK-40114) Arrow 9.0.0 support with SparkR
[ https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40114: Assignee: (was: Apache Spark) > Arrow 9.0.0 support with SparkR > --- > > Key: SPARK-40114 > URL: https://issues.apache.org/jira/browse/SPARK-40114 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > == Failed > == > -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:103:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:133:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:143:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:184:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:217:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:229:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow > optimiz > `count(...)` threw an error with unexpected message. > Expected match: "expected IntegerType, IntegerType, got IntegerType, > StringType" > Actual message: "org.apache.spark.SparkException: Job aborted due to stage > failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task > 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): > org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced > errors: The tzdb package is not installed. Timezones will not be available to > Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : > write_arrow has been removed\nCalls: -> writeRaw -> writeInt -> > writeBin -> \nExecution halted\n\r\n\tat > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown > Source)\r\n\tat >
[jira] [Assigned] (SPARK-40114) Arrow 9.0.0 support with SparkR
[ https://issues.apache.org/jira/browse/SPARK-40114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40114: Assignee: Apache Spark > Arrow 9.0.0 support with SparkR > --- > > Key: SPARK-40114 > URL: https://issues.apache.org/jira/browse/SPARK-40114 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > == Failed > == > -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:103:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:133:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:143:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization > - > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:184:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. SparkR::collect(ret) > at test_sparkSQL_arrow.R:217:2 > 2. SparkR::collect(ret) > 3. SparkR (local) .local(x, ...) > 7. SparkR:::readRaw(conn) > 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type > sp > Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid > 'n' argument > Backtrace: > 1. testthat::expect_true(all(collect(ret) == rdf)) >at test_sparkSQL_arrow.R:229:2 > 5. SparkR::collect(ret) > 6. SparkR (local) .local(x, ...) > 10. SparkR:::readRaw(conn) > 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") > -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow > optimiz > `count(...)` threw an error with unexpected message. > Expected match: "expected IntegerType, IntegerType, got IntegerType, > StringType" > Actual message: "org.apache.spark.SparkException: Job aborted due to stage > failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task > 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): > org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced > errors: The tzdb package is not installed. Timezones will not be available to > Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : > write_arrow has been removed\nCalls: -> writeRaw -> writeInt -> > writeBin -> \nExecution halted\n\r\n\tat > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat > > org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat > > org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown >
[jira] [Assigned] (SPARK-40121) Initialize projection used for Python UDF
[ https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40121: Assignee: Apache Spark > Initialize projection used for Python UDF > - > > Key: SPARK-40121 > URL: https://issues.apache.org/jira/browse/SPARK-40121 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > >>> from pyspark.sql.functions import udf, rand > >>> spark.range(10).select(udf(lambda x: x)(rand())).show() > {code} > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40121) Initialize projection used for Python UDF
[ https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40121: Assignee: (was: Apache Spark) > Initialize projection used for Python UDF > - > > Key: SPARK-40121 > URL: https://issues.apache.org/jira/browse/SPARK-40121 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > >>> from pyspark.sql.functions import udf, rand > >>> spark.range(10).select(udf(lambda x: x)(rand())).show() > {code} > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40121) Initialize projection used for Python UDF
[ https://issues.apache.org/jira/browse/SPARK-40121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580714#comment-17580714 ] Apache Spark commented on SPARK-40121: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37552 > Initialize projection used for Python UDF > - > > Key: SPARK-40121 > URL: https://issues.apache.org/jira/browse/SPARK-40121 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > >>> from pyspark.sql.functions import udf, rand > >>> spark.range(10).select(udf(lambda x: x)(rand())).show() > {code} > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580711#comment-17580711 ] Steve Loughran commented on SPARK-38954: any plans to put the PR up? i'm curious about what you've done. The hadoop s3a delegation tokens can be used to collect credentials and encryption secrets at spark launch, pass them to workers, though there's no mechanism to update tokens during the life of a session. you might want to look at this code, and experiment with it. if you are doing your own provider, do update credentials at least 30s before they expire, and add some sync blocks so that 30 threads don't all try and do it independently. > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40121) Initialize projection used for Python UDF
Hyukjin Kwon created SPARK-40121: Summary: Initialize projection used for Python UDF Key: SPARK-40121 URL: https://issues.apache.org/jira/browse/SPARK-40121 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.4.0 Reporter: Hyukjin Kwon {code} >>> from pyspark.sql.functions import udf, rand >>> spark.range(10).select(udf(lambda x: x)(rand())).show() {code} {code} java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38445) Are hadoop committers used in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-38445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580710#comment-17580710 ] Steve Loughran commented on SPARK-38445: SPARK-40039 might address this > Are hadoop committers used in Structured Streaming? > --- > > Key: SPARK-38445 > URL: https://issues.apache.org/jira/browse/SPARK-38445 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Martin Andersson >Priority: Major > Labels: structured-streaming > > At the company I work at we're using Spark Structured Streaming to sink > messages on kafka to HDFS. We're in the late stages of migrating this > component to instead sink messages to AWS S3, and in connection with that we > hit upon a couple of issues regarding hadoop committers. > I've come to understand that the default "file" committer (documented > [here|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#Switching_to_an_S3A_Committer]) > is unsafe to use in S3, which is why [this page in the spark > documentation|https://spark.apache.org/docs/3.2.1/cloud-integration.html] > recommends using the "directory" (i.e. staging) committer, and in later > versions of hadoop they also recommend to use the "magic" committer. > However, it's not clear whether spark structured streaming even use > committers. There's no "_SUCCESS" file in destination (as compared to normal > spark jobs), and the documentation regarding committers used in streaming is > non-existent. > Can anyone please shed some light on this? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition
[ https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40105: --- Assignee: XiDuo You > Improve repartition in ReplaceCTERefWithRepartition > --- > > Key: SPARK-40105 > URL: https://issues.apache.org/jira/browse/SPARK-40105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Fix For: 3.4.0 > > > If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition > to force a shuffle so that the reference can reuse shuffle exchange. > The added repartition should be optimized by AQE for better performance. > If the user has specified a rebalance, the ReplaceCTERefWithRepartition > should skip add repartition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580707#comment-17580707 ] Steve Loughran edited comment on SPARK-38330 at 8/17/22 9:46 AM: - bq. Is there a way to work-around this issue while waiting for a version of Spark which uses hadoop 3.3.4 (Spark 3.4?) remove all jars with cos in the title from your classpath note, emr is unaffected by this. so are cloudera products, primarily because they never backported the cos module. this is why it didn't show up in those tests. was (Author: ste...@apache.org): bq. Is there a way to work-around this issue while waiting for a version of Spark which uses hadoop 3.3.4 (Spark 3.4?) remove all jars with cos in the title from your classpath > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at >
[jira] [Resolved] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition
[ https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40105. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37537 [https://github.com/apache/spark/pull/37537] > Improve repartition in ReplaceCTERefWithRepartition > --- > > Key: SPARK-40105 > URL: https://issues.apache.org/jira/browse/SPARK-40105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > Fix For: 3.4.0 > > > If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition > to force a shuffle so that the reference can reuse shuffle exchange. > The added repartition should be optimized by AQE for better performance. > If the user has specified a rebalance, the ReplaceCTERefWithRepartition > should skip add repartition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580707#comment-17580707 ] Steve Loughran edited comment on SPARK-38330 at 8/17/22 9:45 AM: - bq. Is there a way to work-around this issue while waiting for a version of Spark which uses hadoop 3.3.4 (Spark 3.4?) remove all jars with cos in the title from your classpath was (Author: ste...@apache.org): remove all jars with cos in the title from your classpath > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > at >
[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580707#comment-17580707 ] Steve Loughran commented on SPARK-38330: remove all jars with cos in the title from your classpath > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > at > com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) > at >
[jira] [Created] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained
Hyukjin Kwon created SPARK-40120: Summary: Make pyspark.sql.readwriter examples self-contained Key: SPARK-40120 URL: https://issues.apache.org/jira/browse/SPARK-40120 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40050) Eliminate the Sort if there is a LocalLimit between Join and Sort
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40050: Summary: Eliminate the Sort if there is a LocalLimit between Join and Sort (was: Eliminate sort if parent is local limit) > Eliminate the Sort if there is a LocalLimit between Join and Sort > - > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40119) Add reason for cancelJobGroup
Santosh Pingale created SPARK-40119: --- Summary: Add reason for cancelJobGroup Key: SPARK-40119 URL: https://issues.apache.org/jira/browse/SPARK-40119 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Santosh Pingale Currently, `cancelJob` supports passing the reason for failure. We use `cancelJobGroup` in a few cases of async actions. It would be great to pass reason of cancellation to the job group. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40050) Eliminate sort if parent is local limit
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40050: Assignee: Apache Spark > Eliminate sort if parent is local limit > --- > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40050) Eliminate sort if parent is local limit
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580661#comment-17580661 ] Apache Spark commented on SPARK-40050: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37519 > Eliminate sort if parent is local limit > --- > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40050) Eliminate sort if parent is local limit
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40050: Assignee: (was: Apache Spark) > Eliminate sort if parent is local limit > --- > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org