[spark] branch master updated (406455d -> a85c51f)

2021-11-21 Thread sarutak
This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 406455d  Revert "[SPARK-36231][PYTHON] Support arithmetic operations 
of decimal(nan) series"
 add a85c51f  [SPARK-37354][K8S][TESTS] Make the Java version installed on 
the container image used by the K8s integration tests with SBT configurable

No new revisions were added by this update.

Summary of changes:
 project/SparkBuild.scala | 1 +
 1 file changed, 1 insertion(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: Revert "[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series"

2021-11-21 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 406455d  Revert "[SPARK-36231][PYTHON] Support arithmetic operations 
of decimal(nan) series"
406455d is described below

commit 406455d79f787486f9e6fab1dce0d9a2645b8d14
Author: Hyukjin Kwon 
AuthorDate: Mon Nov 22 11:33:05 2021 +0900

Revert "[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) 
series"

This reverts commit 4529dba42769f13ab8cfbb9798a5f82eaaf17b34.
---
 python/pyspark/pandas/data_type_ops/num_ops.py | 23 +
 .../pandas/tests/data_type_ops/test_num_ops.py | 56 ++
 .../pandas/tests/data_type_ops/testing_utils.py| 50 ---
 3 files changed, 80 insertions(+), 49 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/num_ops.py 
b/python/pyspark/pandas/data_type_ops/num_ops.py
index e08d6e9..3e74664 100644
--- a/python/pyspark/pandas/data_type_ops/num_ops.py
+++ b/python/pyspark/pandas/data_type_ops/num_ops.py
@@ -55,7 +55,7 @@ def _non_fractional_astype(
 elif isinstance(spark_type, BooleanType):
 return _as_bool_type(index_ops, dtype)
 elif isinstance(spark_type, StringType):
-return _as_string_type(index_ops, dtype, null_str="NaN")
+return _as_string_type(index_ops, dtype, null_str=str(np.nan))
 else:
 return _as_other_type(index_ops, dtype, spark_type)
 
@@ -447,29 +447,10 @@ class DecimalOps(FractionalOps):
 return index_ops.copy()
 
 def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) 
-> IndexOpsLike:
+# TODO(SPARK-36230): check index_ops.hasnans after fixing SPARK-36230
 dtype, spark_type = pandas_on_spark_type(dtype)
-if is_integer_dtype(dtype) and not isinstance(dtype, extension_dtypes):
-if index_ops.hasnans:
-raise ValueError(
-"Cannot convert %s with missing values to integer" % 
self.pretty_name
-)
 return _non_fractional_astype(index_ops, dtype, spark_type)
 
-def rpow(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
-_sanitize_list_like(right)
-if not isinstance(right, numbers.Number):
-raise TypeError("Exponentiation can not be applied to given 
types.")
-
-def rpow_func(left: Column, right: Any) -> Column:
-return (
-F.when(left.isNull(), np.nan)
-.when(SF.lit(right == 1), right)
-.otherwise(Column.__rpow__(left, right))
-)
-
-right = transform_boolean_operand_to_numeric(right)
-return column_op(rpow_func)(left, right)
-
 
 class IntegralExtensionOps(IntegralOps):
 """
diff --git a/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py 
b/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
index f4b36f9..4d1fb23 100644
--- a/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
+++ b/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
@@ -172,13 +172,11 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):
 pdf, psdf = self.pdf, self.psdf
 for col in self.numeric_df_cols:
 pser, psser = pdf[col], psdf[col]
-if col in ["float", "float_w_nan"]:
+if col == "float":
 self.assert_eq(pser ** pser, psser ** psser)
 self.assert_eq(pser ** pser.astype(bool), psser ** 
psser.astype(bool))
 self.assert_eq(pser ** True, psser ** True)
 self.assert_eq(pser ** False, psser ** False)
-self.assert_eq(pser ** 1, psser ** 1)
-self.assert_eq(pser ** 0, psser ** 0)
 
 for n_col in self.non_numeric_df_cols:
 if n_col == "bool":
@@ -186,6 +184,18 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):
 else:
 self.assertRaises(TypeError, lambda: psser ** psdf[n_col])
 
+# TODO(SPARK-36031): Merge test_pow_with_nan into test_pow
+def test_pow_with_float_nan(self):
+for col in self.numeric_w_nan_df_cols:
+if col == "float_w_nan":
+pser, psser = self.numeric_w_nan_pdf[col], 
self.numeric_w_nan_psdf[col]
+self.assert_eq(pser ** pser, psser ** psser)
+self.assert_eq(pser ** pser.astype(bool), psser ** 
psser.astype(bool))
+self.assert_eq(pser ** True, psser ** True)
+self.assert_eq(pser ** False, psser ** False)
+self.assert_eq(pser ** 1, psser ** 1)
+self.assert_eq(pser ** 0, psser ** 0)
+
 def test_radd(self):
 pdf, psdf = self.pdf, self.psdf
 for col in self.numeric_df_cols:
@@ -334,36 +344,40 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):

[spark] branch branch-3.0 updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

2021-11-21 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 7ba340c  [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` 
releated UTs when using `hadoop-3.2` profile without 
`assembly/target/scala-%s/jars`
7ba340c is described below

commit 7ba340c952fec83f59b6d0111c849ed8afbe99f1
Author: yangjie01 
AuthorDate: Sun Nov 21 19:11:40 2021 -0600

[SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs 
when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### What changes were proposed in this pull request?
`YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` 
profile without `assembly/target/scala-%s/jars`,  the fail reason is 
`java.lang.NoClassDefFoundError: breeze/linalg/Matrix`.

The above UTS can succeed when using `hadoop-2.7` profile without 
`assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` 
can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false 
when using `hadoop-3.2` profile.

After investigated, I found that when `hadoop-2.7` profile is used, 
`SPARK_TESTING`  will be propagated to AM and Executor, but when `hadoop-3.2` 
profile is used, `SPARK_TESTING`  will not be propagated to AM and Executor.

In order to ensure the consistent behavior of using `hadoop-2.7` and 
``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` 
environment variable if it exists to ensure `Utils.isTesting` is true in above 
test scenario.

### Why are the changes needed?
Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using 
`hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite`  can be verified in the same way.

Please ensure that the `assembly/target/scala-%s/jars` directory does not 
exist before executing the test command, we can clean up the whole project by 
executing follow command or clone a new local code repo.

1. run with `hadoop-3.2` profile

```
mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite
```

**Before**

```
YarnShuffleIntegrationSuite:
- external shuffle service *** FAILED ***
  FAILED did not equal FINISHED (stdout/stderr was not captured) 
(BaseYarnClusterSuite.scala:227)
Run completed in 48 seconds, 137 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
```

Error stack as follows:

```
21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: 
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost 
executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
at 
org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
at scala.collection.immutable.List.flatMap(List.scala:366)
at 
org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
at 
org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
at 
org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
at 
org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
at 
org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
at 

[spark] branch branch-3.1 updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

2021-11-21 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new a1851cb  [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` 
releated UTs when using `hadoop-3.2` profile without 
`assembly/target/scala-%s/jars`
a1851cb is described below

commit a1851cb8087eb3acd6b1d894148babe54eb3aa53
Author: yangjie01 
AuthorDate: Sun Nov 21 19:11:40 2021 -0600

[SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs 
when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### What changes were proposed in this pull request?
`YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` 
profile without `assembly/target/scala-%s/jars`,  the fail reason is 
`java.lang.NoClassDefFoundError: breeze/linalg/Matrix`.

The above UTS can succeed when using `hadoop-2.7` profile without 
`assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` 
can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false 
when using `hadoop-3.2` profile.

After investigated, I found that when `hadoop-2.7` profile is used, 
`SPARK_TESTING`  will be propagated to AM and Executor, but when `hadoop-3.2` 
profile is used, `SPARK_TESTING`  will not be propagated to AM and Executor.

In order to ensure the consistent behavior of using `hadoop-2.7` and 
``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` 
environment variable if it exists to ensure `Utils.isTesting` is true in above 
test scenario.

### Why are the changes needed?
Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using 
`hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite`  can be verified in the same way.

Please ensure that the `assembly/target/scala-%s/jars` directory does not 
exist before executing the test command, we can clean up the whole project by 
executing follow command or clone a new local code repo.

1. run with `hadoop-3.2` profile

```
mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite
```

**Before**

```
YarnShuffleIntegrationSuite:
- external shuffle service *** FAILED ***
  FAILED did not equal FINISHED (stdout/stderr was not captured) 
(BaseYarnClusterSuite.scala:227)
Run completed in 48 seconds, 137 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
```

Error stack as follows:

```
21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: 
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost 
executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
at 
org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
at scala.collection.immutable.List.flatMap(List.scala:366)
at 
org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
at 
org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
at 
org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
at 
org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
at 
org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
at 

[spark] branch branch-3.2 updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

2021-11-21 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new b27be1f  [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` 
releated UTs when using `hadoop-3.2` profile without 
`assembly/target/scala-%s/jars`
b27be1f is described below

commit b27be1fe956f4ce7bef6a0d96e7b4402c4998887
Author: yangjie01 
AuthorDate: Sun Nov 21 19:11:40 2021 -0600

[SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs 
when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### What changes were proposed in this pull request?
`YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` 
profile without `assembly/target/scala-%s/jars`,  the fail reason is 
`java.lang.NoClassDefFoundError: breeze/linalg/Matrix`.

The above UTS can succeed when using `hadoop-2.7` profile without 
`assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` 
can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false 
when using `hadoop-3.2` profile.

After investigated, I found that when `hadoop-2.7` profile is used, 
`SPARK_TESTING`  will be propagated to AM and Executor, but when `hadoop-3.2` 
profile is used, `SPARK_TESTING`  will not be propagated to AM and Executor.

In order to ensure the consistent behavior of using `hadoop-2.7` and 
``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` 
environment variable if it exists to ensure `Utils.isTesting` is true in above 
test scenario.

### Why are the changes needed?
Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using 
`hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite`  can be verified in the same way.

Please ensure that the `assembly/target/scala-%s/jars` directory does not 
exist before executing the test command, we can clean up the whole project by 
executing follow command or clone a new local code repo.

1. run with `hadoop-3.2` profile

```
mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite
```

**Before**

```
YarnShuffleIntegrationSuite:
- external shuffle service *** FAILED ***
  FAILED did not equal FINISHED (stdout/stderr was not captured) 
(BaseYarnClusterSuite.scala:227)
Run completed in 48 seconds, 137 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
```

Error stack as follows:

```
21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: 
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost 
executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
at 
org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
at scala.collection.immutable.List.flatMap(List.scala:366)
at 
org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
at 
org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
at 
org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
at 
org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
at 
org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
at 

[spark] branch master updated: [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

2021-11-21 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a7b3fc7  [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` 
releated UTs when using `hadoop-3.2` profile without 
`assembly/target/scala-%s/jars`
a7b3fc7 is described below

commit a7b3fc7cef4c5df0254b945fe9f6815b072b31dd
Author: yangjie01 
AuthorDate: Sun Nov 21 19:11:40 2021 -0600

[SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs 
when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### What changes were proposed in this pull request?
`YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` 
profile without `assembly/target/scala-%s/jars`,  the fail reason is 
`java.lang.NoClassDefFoundError: breeze/linalg/Matrix`.

The above UTS can succeed when using `hadoop-2.7` profile without 
`assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` 
can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false 
when using `hadoop-3.2` profile.

After investigated, I found that when `hadoop-2.7` profile is used, 
`SPARK_TESTING`  will be propagated to AM and Executor, but when `hadoop-3.2` 
profile is used, `SPARK_TESTING`  will not be propagated to AM and Executor.

In order to ensure the consistent behavior of using `hadoop-2.7` and 
``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` 
environment variable if it exists to ensure `Utils.isTesting` is true in above 
test scenario.

### Why are the changes needed?
Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using 
`hadoop-3.2` profile without `assembly/target/scala-%s/jars`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and 
`YarnShuffleAlternateNameConfigSuite`  can be verified in the same way.

Please ensure that the `assembly/target/scala-%s/jars` directory does not 
exist before executing the test command, we can clean up the whole project by 
executing follow command or clone a new local code repo.

1. run with `hadoop-3.2` profile

```
mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite
```

**Before**

```
YarnShuffleIntegrationSuite:
- external shuffle service *** FAILED ***
  FAILED did not equal FINISHED (stdout/stderr was not captured) 
(BaseYarnClusterSuite.scala:227)
Run completed in 48 seconds, 137 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
```

Error stack as follows:

```
21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: 
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost 
executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
at 
org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
at scala.collection.immutable.List.flatMap(List.scala:366)
at 
org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
at 
org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
at 
org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
at 
org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
at 
org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
at 

[spark] branch master updated: [SPARK-37104][PYTHON] Make RDD and DStream covariant

2021-11-21 Thread zero323
This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef4f254  [SPARK-37104][PYTHON] Make RDD and DStream covariant
ef4f254 is described below

commit ef4f2546c58ef5fe67be7047f9aa2a793519fd54
Author: zero323 
AuthorDate: Sun Nov 21 16:25:57 2021 +0100

[SPARK-37104][PYTHON] Make RDD and DStream covariant

### What changes were proposed in this pull request?

This PR changes changes `RDD[~T]` and `DStream[~T]` to `RDD[+T]` and 
`DStream[+T]` respectively.

### Why are the changes needed?

To improve usability of the current annotations and simplify further 
development of type hints.  Let's take simple `RDD` to `DataFrame` as an 
example. Currently, the following code will not type check

```python
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

rdd = sc.parallelize([(1, 2)])
reveal_type(rdd)

spark.createDataFrame(rdd)
```

with

```
main.py:8: note: Revealed type is "pyspark.rdd.RDD[Tuple[builtins.int, 
builtins.int]]"
main.py:10: error: Argument 1 to "createDataFrame" of "SparkSession" has 
incompatible type "RDD[Tuple[int, int]]"; expected "Union[RDD[Tuple[Any, ...]], 
Iterable[Tuple[Any, ...]]]"
Found 1 error in 1 file (checked 1 source file)
```

To type check, `rdd` would have to be annotated with specific type, 
matching the signature of the `createDataFrame` method:

```python
rdd: RDD[Tuple[Any, ...]] = sc.parallelize([(1, 2)])
```

Alternatively, one could inline definition:

```python
spark.createDataFrame(sc.parallelize([(1, 2)]))
```

Similarly, with `pyspark.mllib`:

```python
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.linalg import SparseVector, Vectors

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

rdd = sc.parallelize([
Vectors.sparse(10, [1, 3, 5], [1, 1, 1]),
Vectors.sparse(10, [2, 4, 6], [1, 1, 1]),
])

KMeans.train(rdd, 2)
```

we'd get

```
main.py:14: error: Argument 1 to "train" of "KMeans" has incompatible type 
"RDD[SparseVector]"; expected "RDD[Union[ndarray[Any, Any], Vector, 
List[float], Tuple[float, ...]]]"
Found 1 error in 1 file (checked 1 source file)
```

but this time, we'd need much more complex annotation (inlining would work 
as well):

```python
rdd: RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]] 
= sc.parallelize([
Vectors.sparse(10, [1, 3, 5], [1, 1, 1]),
Vectors.sparse(10, [2, 4, 6], [1, 1, 1]),
])
```

This happens because

- RDD is invariant in terms of stored type.
- mypy doesn't look forward to infer types of objects depending on the 
usage context (similarly to Scala console / spark-shell, but unlike standalone 
Scala compiler, which allows us to have [examples like 
this](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala))

It not only makes things verbose, but also fragile and dependent on details 
of implementation. In the first example, where we have top level `Union`, we 
can just use `RDD[...]` and ignore other members.

In the second case, where `Union` is a type parameter we have to match all 
its components (it could be simpler if we didn't use `RDD[VectorLike]` but 
defined something like `RDD[ndarray] | RDD[Vector] | RDD[List[float]] | 
RDD[Tuple[float, ...]]]`, which should make it closer to the first case, though 
not semantically equivalent to the current signature).

Theoretically, we could partially address this with different definitions 
of aliases, like using type bounds (see discussion under #34354), but it 
doesn't scale well and requires same steps to be taken by every library that 
depends on PySpark.

See also related discussion about Scala counterpart ‒ SPARK-1296

### Does this PR introduce _any_ user-facing change?

Type hints only.

Users will be able to use both subclasses of `RDD` / `DStream` in certain 
contexts, without explicit annotations or casts (both examples will pass type 
checker in their original form).

### How was this patch tested?

Existing tests and not released data tests (SPARK-36989).

Closes #34374 from zero323/SPARK-37104.

Authored-by: zero323 
Signed-off-by: zero323 
---
 python/pyspark/_typing.pyi   |   4 +-
 python/pyspark/rdd.pyi   | 107 ++-