[spark] branch master updated (406455d -> a85c51f)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 406455d Revert "[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series" add a85c51f [SPARK-37354][K8S][TESTS] Make the Java version installed on the container image used by the K8s integration tests with SBT configurable No new revisions were added by this update. Summary of changes: project/SparkBuild.scala | 1 + 1 file changed, 1 insertion(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: Revert "[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 406455d Revert "[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series" 406455d is described below commit 406455d79f787486f9e6fab1dce0d9a2645b8d14 Author: Hyukjin Kwon AuthorDate: Mon Nov 22 11:33:05 2021 +0900 Revert "[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series" This reverts commit 4529dba42769f13ab8cfbb9798a5f82eaaf17b34. --- python/pyspark/pandas/data_type_ops/num_ops.py | 23 + .../pandas/tests/data_type_ops/test_num_ops.py | 56 ++ .../pandas/tests/data_type_ops/testing_utils.py| 50 --- 3 files changed, 80 insertions(+), 49 deletions(-) diff --git a/python/pyspark/pandas/data_type_ops/num_ops.py b/python/pyspark/pandas/data_type_ops/num_ops.py index e08d6e9..3e74664 100644 --- a/python/pyspark/pandas/data_type_ops/num_ops.py +++ b/python/pyspark/pandas/data_type_ops/num_ops.py @@ -55,7 +55,7 @@ def _non_fractional_astype( elif isinstance(spark_type, BooleanType): return _as_bool_type(index_ops, dtype) elif isinstance(spark_type, StringType): -return _as_string_type(index_ops, dtype, null_str="NaN") +return _as_string_type(index_ops, dtype, null_str=str(np.nan)) else: return _as_other_type(index_ops, dtype, spark_type) @@ -447,29 +447,10 @@ class DecimalOps(FractionalOps): return index_ops.copy() def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: +# TODO(SPARK-36230): check index_ops.hasnans after fixing SPARK-36230 dtype, spark_type = pandas_on_spark_type(dtype) -if is_integer_dtype(dtype) and not isinstance(dtype, extension_dtypes): -if index_ops.hasnans: -raise ValueError( -"Cannot convert %s with missing values to integer" % self.pretty_name -) return _non_fractional_astype(index_ops, dtype, spark_type) -def rpow(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -_sanitize_list_like(right) -if not isinstance(right, numbers.Number): -raise TypeError("Exponentiation can not be applied to given types.") - -def rpow_func(left: Column, right: Any) -> Column: -return ( -F.when(left.isNull(), np.nan) -.when(SF.lit(right == 1), right) -.otherwise(Column.__rpow__(left, right)) -) - -right = transform_boolean_operand_to_numeric(right) -return column_op(rpow_func)(left, right) - class IntegralExtensionOps(IntegralOps): """ diff --git a/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py b/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py index f4b36f9..4d1fb23 100644 --- a/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py +++ b/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py @@ -172,13 +172,11 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils): pdf, psdf = self.pdf, self.psdf for col in self.numeric_df_cols: pser, psser = pdf[col], psdf[col] -if col in ["float", "float_w_nan"]: +if col == "float": self.assert_eq(pser ** pser, psser ** psser) self.assert_eq(pser ** pser.astype(bool), psser ** psser.astype(bool)) self.assert_eq(pser ** True, psser ** True) self.assert_eq(pser ** False, psser ** False) -self.assert_eq(pser ** 1, psser ** 1) -self.assert_eq(pser ** 0, psser ** 0) for n_col in self.non_numeric_df_cols: if n_col == "bool": @@ -186,6 +184,18 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils): else: self.assertRaises(TypeError, lambda: psser ** psdf[n_col]) +# TODO(SPARK-36031): Merge test_pow_with_nan into test_pow +def test_pow_with_float_nan(self): +for col in self.numeric_w_nan_df_cols: +if col == "float_w_nan": +pser, psser = self.numeric_w_nan_pdf[col], self.numeric_w_nan_psdf[col] +self.assert_eq(pser ** pser, psser ** psser) +self.assert_eq(pser ** pser.astype(bool), psser ** psser.astype(bool)) +self.assert_eq(pser ** True, psser ** True) +self.assert_eq(pser ** False, psser ** False) +self.assert_eq(pser ** 1, psser ** 1) +self.assert_eq(pser ** 0, psser ** 0) + def test_radd(self): pdf, psdf = self.pdf, self.psdf for col in self.numeric_df_cols: @@ -334,36 +344,40 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):
[spark] branch branch-3.0 updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 7ba340c [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` 7ba340c is described below commit 7ba340c952fec83f59b6d0111c849ed8afbe99f1 Author: yangjie01 AuthorDate: Sun Nov 21 19:11:40 2021 -0600 [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### What changes were proposed in this pull request? `YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`, the fail reason is `java.lang.NoClassDefFoundError: breeze/linalg/Matrix`. The above UTS can succeed when using `hadoop-2.7` profile without `assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false when using `hadoop-3.2` profile. After investigated, I found that when `hadoop-2.7` profile is used, `SPARK_TESTING` will be propagated to AM and Executor, but when `hadoop-3.2` profile is used, `SPARK_TESTING` will not be propagated to AM and Executor. In order to ensure the consistent behavior of using `hadoop-2.7` and ``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` environment variable if it exists to ensure `Utils.isTesting` is true in above test scenario. ### Why are the changes needed? Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` can be verified in the same way. Please ensure that the `assembly/target/scala-%s/jars` directory does not exist before executing the test command, we can clean up the whole project by executing follow command or clone a new local code repo. 1. run with `hadoop-3.2` profile ``` mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite ``` **Before** ``` YarnShuffleIntegrationSuite: - external shuffle service *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:227) Run completed in 48 seconds, 137 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` Error stack as follows: ``` 21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:216) at org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) at org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) at org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) at
[spark] branch branch-3.1 updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new a1851cb [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` a1851cb is described below commit a1851cb8087eb3acd6b1d894148babe54eb3aa53 Author: yangjie01 AuthorDate: Sun Nov 21 19:11:40 2021 -0600 [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### What changes were proposed in this pull request? `YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`, the fail reason is `java.lang.NoClassDefFoundError: breeze/linalg/Matrix`. The above UTS can succeed when using `hadoop-2.7` profile without `assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false when using `hadoop-3.2` profile. After investigated, I found that when `hadoop-2.7` profile is used, `SPARK_TESTING` will be propagated to AM and Executor, but when `hadoop-3.2` profile is used, `SPARK_TESTING` will not be propagated to AM and Executor. In order to ensure the consistent behavior of using `hadoop-2.7` and ``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` environment variable if it exists to ensure `Utils.isTesting` is true in above test scenario. ### Why are the changes needed? Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` can be verified in the same way. Please ensure that the `assembly/target/scala-%s/jars` directory does not exist before executing the test command, we can clean up the whole project by executing follow command or clone a new local code repo. 1. run with `hadoop-3.2` profile ``` mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite ``` **Before** ``` YarnShuffleIntegrationSuite: - external shuffle service *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:227) Run completed in 48 seconds, 137 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` Error stack as follows: ``` 21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:216) at org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) at org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) at org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) at
[spark] branch branch-3.2 updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new b27be1f [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` b27be1f is described below commit b27be1fe956f4ce7bef6a0d96e7b4402c4998887 Author: yangjie01 AuthorDate: Sun Nov 21 19:11:40 2021 -0600 [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### What changes were proposed in this pull request? `YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`, the fail reason is `java.lang.NoClassDefFoundError: breeze/linalg/Matrix`. The above UTS can succeed when using `hadoop-2.7` profile without `assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false when using `hadoop-3.2` profile. After investigated, I found that when `hadoop-2.7` profile is used, `SPARK_TESTING` will be propagated to AM and Executor, but when `hadoop-3.2` profile is used, `SPARK_TESTING` will not be propagated to AM and Executor. In order to ensure the consistent behavior of using `hadoop-2.7` and ``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` environment variable if it exists to ensure `Utils.isTesting` is true in above test scenario. ### Why are the changes needed? Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` can be verified in the same way. Please ensure that the `assembly/target/scala-%s/jars` directory does not exist before executing the test command, we can clean up the whole project by executing follow command or clone a new local code repo. 1. run with `hadoop-3.2` profile ``` mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite ``` **Before** ``` YarnShuffleIntegrationSuite: - external shuffle service *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:227) Run completed in 48 seconds, 137 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` Error stack as follows: ``` 21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:216) at org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) at org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) at org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) at
[spark] branch master updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a7b3fc7 [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` a7b3fc7 is described below commit a7b3fc7cef4c5df0254b945fe9f6815b072b31dd Author: yangjie01 AuthorDate: Sun Nov 21 19:11:40 2021 -0600 [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### What changes were proposed in this pull request? `YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`, the fail reason is `java.lang.NoClassDefFoundError: breeze/linalg/Matrix`. The above UTS can succeed when using `hadoop-2.7` profile without `assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false when using `hadoop-3.2` profile. After investigated, I found that when `hadoop-2.7` profile is used, `SPARK_TESTING` will be propagated to AM and Executor, but when `hadoop-3.2` profile is used, `SPARK_TESTING` will not be propagated to AM and Executor. In order to ensure the consistent behavior of using `hadoop-2.7` and ``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` environment variable if it exists to ensure `Utils.isTesting` is true in above test scenario. ### Why are the changes needed? Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` can be verified in the same way. Please ensure that the `assembly/target/scala-%s/jars` directory does not exist before executing the test command, we can clean up the whole project by executing follow command or clone a new local code repo. 1. run with `hadoop-3.2` profile ``` mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite ``` **Before** ``` YarnShuffleIntegrationSuite: - external shuffle service *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:227) Run completed in 48 seconds, 137 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` Error stack as follows: ``` 21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:216) at org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) at org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) at org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) at
[spark] branch master updated: [SPARK-37104][PYTHON] Make RDD and DStream covariant
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ef4f254 [SPARK-37104][PYTHON] Make RDD and DStream covariant ef4f254 is described below commit ef4f2546c58ef5fe67be7047f9aa2a793519fd54 Author: zero323 AuthorDate: Sun Nov 21 16:25:57 2021 +0100 [SPARK-37104][PYTHON] Make RDD and DStream covariant ### What changes were proposed in this pull request? This PR changes changes `RDD[~T]` and `DStream[~T]` to `RDD[+T]` and `DStream[+T]` respectively. ### Why are the changes needed? To improve usability of the current annotations and simplify further development of type hints. Let's take simple `RDD` to `DataFrame` as an example. Currently, the following code will not type check ```python from pyspark import SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext rdd = sc.parallelize([(1, 2)]) reveal_type(rdd) spark.createDataFrame(rdd) ``` with ``` main.py:8: note: Revealed type is "pyspark.rdd.RDD[Tuple[builtins.int, builtins.int]]" main.py:10: error: Argument 1 to "createDataFrame" of "SparkSession" has incompatible type "RDD[Tuple[int, int]]"; expected "Union[RDD[Tuple[Any, ...]], Iterable[Tuple[Any, ...]]]" Found 1 error in 1 file (checked 1 source file) ``` To type check, `rdd` would have to be annotated with specific type, matching the signature of the `createDataFrame` method: ```python rdd: RDD[Tuple[Any, ...]] = sc.parallelize([(1, 2)]) ``` Alternatively, one could inline definition: ```python spark.createDataFrame(sc.parallelize([(1, 2)])) ``` Similarly, with `pyspark.mllib`: ```python from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.mllib.clustering import KMeans from pyspark.mllib.linalg import SparseVector, Vectors spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext rdd = sc.parallelize([ Vectors.sparse(10, [1, 3, 5], [1, 1, 1]), Vectors.sparse(10, [2, 4, 6], [1, 1, 1]), ]) KMeans.train(rdd, 2) ``` we'd get ``` main.py:14: error: Argument 1 to "train" of "KMeans" has incompatible type "RDD[SparseVector]"; expected "RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]]" Found 1 error in 1 file (checked 1 source file) ``` but this time, we'd need much more complex annotation (inlining would work as well): ```python rdd: RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]] = sc.parallelize([ Vectors.sparse(10, [1, 3, 5], [1, 1, 1]), Vectors.sparse(10, [2, 4, 6], [1, 1, 1]), ]) ``` This happens because - RDD is invariant in terms of stored type. - mypy doesn't look forward to infer types of objects depending on the usage context (similarly to Scala console / spark-shell, but unlike standalone Scala compiler, which allows us to have [examples like this](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala)) It not only makes things verbose, but also fragile and dependent on details of implementation. In the first example, where we have top level `Union`, we can just use `RDD[...]` and ignore other members. In the second case, where `Union` is a type parameter we have to match all its components (it could be simpler if we didn't use `RDD[VectorLike]` but defined something like `RDD[ndarray] | RDD[Vector] | RDD[List[float]] | RDD[Tuple[float, ...]]]`, which should make it closer to the first case, though not semantically equivalent to the current signature). Theoretically, we could partially address this with different definitions of aliases, like using type bounds (see discussion under #34354), but it doesn't scale well and requires same steps to be taken by every library that depends on PySpark. See also related discussion about Scala counterpart ‒ SPARK-1296 ### Does this PR introduce _any_ user-facing change? Type hints only. Users will be able to use both subclasses of `RDD` / `DStream` in certain contexts, without explicit annotations or casts (both examples will pass type checker in their original form). ### How was this patch tested? Existing tests and not released data tests (SPARK-36989). Closes #34374 from zero323/SPARK-37104. Authored-by: zero323 Signed-off-by: zero323 --- python/pyspark/_typing.pyi | 4 +- python/pyspark/rdd.pyi | 107 ++-