date:20211114

[spark] branch master updated: [SPARK-32567][SQL][FOLLOWUP] Fix CompileException when code generate for FULL OUTER shuffled hash join

2021-11-14 Thread viirya

This is an automated email from the ASF dual-hosted git repository.

viirya pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2011ab5  [SPARK-32567][SQL][FOLLOWUP] Fix CompileException when code 
generate for FULL OUTER shuffled hash join
2011ab5 is described below

commit 2011ab5cfb9a878714763296149c6b6f3b46c584
Author: RoryQi <1242949...@qq.com>
AuthorDate: Sun Nov 14 20:54:54 2021 -0800

[SPARK-32567][SQL][FOLLOWUP] Fix CompileException when code generate for 
FULL OUTER shuffled hash join

### What changes were proposed in this pull request?
Add `throws IOException` to `consumeFullOuterJoinRow` in 
ShuffledHashJoinExec

### Why are the changes needed?
pr #3 add code-gen for full outer shuffled hash join. If we don't have 
this patch,   when the dataframes are `fullter outer` shuffled hash joined, and 
then aggregate the results.  the dataframes will throw a `CompileException`.
For example:
```scala
val df1 = spark.range(5).select($"id".as("k1"))
val df2 = spark.range(10).select($"id".as("k2"))
df1.join(df2.hint("SHUFFLE_HASH"), $"k1" === $"k2", 
"full_outer").count()
```
The dataframe will throw an Exception which is as follows:
```
23:28:19.079 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 175, Column 1: Thrown exception of type "java.io.IOException" is neither 
caught by a "try...catch" block nor declared in the "throws" clause of the 
declaring function
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
175, Column 1: Thrown exception of type "java.io.IOException" is neither caught 
by a "try...catch" block nor declared in the "throws" clause of the declaring 
function
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021)
at 
org.codehaus.janino.UnitCompiler.checkThrownException(UnitCompiler.java:9801)
at 
org.codehaus.janino.UnitCompiler.checkThrownExceptions(UnitCompiler.java:9720)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:9163)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5055)
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a new UT

Closes #34589 from jerqi/SPARK-32567.

Authored-by: RoryQi <1242949...@qq.com>
Signed-off-by: Liang-Chi Hsieh 
---
 .../org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala | 2 +-
 .../scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
index 7136229..9e81a6f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
@@ -376,7 +376,7 @@ case class ShuffledHashJoinExec(
 val consumeFullOuterJoinRow = ctx.freshName("consumeFullOuterJoinRow")
 ctx.addNewFunction(consumeFullOuterJoinRow,
   s"""
- |private void $consumeFullOuterJoinRow() {
+ |private void $consumeFullOuterJoinRow() throws java.io.IOException {
  |  ${metricTerm(ctx, "numOutputRows")}.add(1);
  |  ${consume(ctx, resultVars)}
  |}
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
index 7da813c..f483971 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
@@ -183,6 +183,7 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSparkSession
 }.size === 1)
 checkAnswer(joinUniqueDF, Seq(Row(0, 0), Row(1, 1), Row(2, 2), Row(3, 3), 
Row(4, 4),
   Row(null, 5), Row(null, 6), Row(null, 7), Row(null, 8), Row(null, 9)))
+assert(joinUniqueDF.count() === 10)
 
 // test one join with non-unique key from build side
 val joinNonUniqueDF = df1.join(df2.hint("SHUFFLE_HASH"), $"k1" === $"k2" % 
3, "full_outer")

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (bb9e1d9 -> f43d8b5)

2021-11-14 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bb9e1d9  [SPARK-37319][K8S] Support K8s image building with Java 17
 add f43d8b5  [SPARK-36533][SS][FOLLOWUP] Address Trigger.AvailableNow in 
PySpark in SS guide doc

No new revisions were added by this update.

Summary of changes:
 docs/structured-streaming-programming-guide.md | 6 ++
 1 file changed, 6 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (edbc7cf -> bb9e1d9)

2021-11-14 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from edbc7cf  [SPARK-36533][SS][FOLLOWUP] Support Trigger.AvailableNow in 
PySpark
 add bb9e1d9  [SPARK-37319][K8S] Support K8s image building with Java 17

No new revisions were added by this update.

Summary of changes:
 bin/docker-image-tool.sh  | 11 ---
 .../main/dockerfiles/spark/{Dockerfile => Dockerfile.java17}  |  9 -
 2 files changed, 12 insertions(+), 8 deletions(-)
 copy 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/{Dockerfile => 
Dockerfile.java17} (92%)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36533][SS][FOLLOWUP] Support Trigger.AvailableNow in PySpark

2021-11-14 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new edbc7cf  [SPARK-36533][SS][FOLLOWUP] Support Trigger.AvailableNow in 
PySpark
edbc7cf is described below

commit edbc7cf9e00233b35c057c357bf1c6b99f2ba59b
Author: Jungtaek Lim 
AuthorDate: Mon Nov 15 08:59:07 2021 +0900

[SPARK-36533][SS][FOLLOWUP] Support Trigger.AvailableNow in PySpark

### What changes were proposed in this pull request?

This PR proposes to add Trigger.AvailableNow in PySpark on top of #33763.

### Why are the changes needed?

We missed adding Trigger.AvailableNow in PySpark in #33763.

### Does this PR introduce _any_ user-facing change?

Yes, Trigger.AvailableNow will be available in PySpark as well.

### How was this patch tested?

Added simple validation in PySpark doc. Manually tested as below:

```
>>> 
spark.readStream.format("text").load("/WorkArea/ScalaProjects/spark-apache/dist/inputs").writeStream.format("console").trigger(once=True).start()

---
Batch: 0
---
+-+
|value|
+-+
|a|
|b|
|c|
|d|
|e|
+-+

>>> 
spark.readStream.format("text").load("/WorkArea/ScalaProjects/spark-apache/dist/inputs").writeStream.format("console").trigger(availableNow=True).start()

>>> ---
Batch: 0
---
+-+
|value|
+-+
|a|
|b|
|c|
|d|
|e|
+-+

>>> spark.readStream.format("text").option("maxfilespertrigger", 
"2").load("/WorkArea/ScalaProjects/spark-apache/dist/inputs").writeStream.format("console").trigger(availableNow=True).start()

>>> ---
Batch: 0
---
+-+
|value|
+-+
|a|
|b|
+-+

---
Batch: 1
---
+-+
|value|
+-+
|c|
|d|
+-+

---
Batch: 2
---
+-+
|value|
+-+
|e|
+-+

>>>
```

Closes #34592 from HeartSaVioR/SPARK-36533-FOLLOWUP-pyspark.

Authored-by: Jungtaek Lim 
Signed-off-by: Jungtaek Lim 
---
 python/pyspark/sql/streaming.py | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index b2d06f2..53a098c 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -1005,12 +1005,17 @@ class DataStreamWriter(object):
 def trigger(self, *, continuous: str) -> "DataStreamWriter":
 ...
 
+@overload
+def trigger(self, *, availableNow: bool) -> "DataStreamWriter":
+...
+
 def trigger(
 self,
 *,
 processingTime: Optional[str] = None,
 once: Optional[bool] = None,
 continuous: Optional[str] = None,
+availableNow: Optional[bool] = None,
 ) -> "DataStreamWriter":
 """Set the trigger for the stream query. If this is not set it will 
run the query as fast
 as possible, which is equivalent to setting the trigger to 
``processingTime='0 seconds'``.
@@ -1030,6 +1035,9 @@ class DataStreamWriter(object):
 a time interval as a string, e.g. '5 seconds', '1 minute'.
 Set a trigger that runs a continuous query with a given checkpoint
 interval. Only one trigger can be set.
+availableNow : bool, optional
+if set to True, set a trigger that processes all available data in 
multiple
+batches then terminates the query. Only one trigger can be set.
 
 Notes
 -
@@ -1043,12 +1051,14 @@ class DataStreamWriter(object):
 >>> writer = sdf.writeStream.trigger(once=True)
 >>> # trigger the query for execution every 5 seconds
 >>> writer = sdf.writeStream.trigger(continuous='5 seconds')
+>>> # trigger the query for reading all available data with multiple 
batches
+>>> writer = sdf.writeStream.trigger(availableNow=True)
 """
-params = [processingTime, once, continuous]
+params = [processingTime, once, continuous, availableNow]
 
-if params.count(None) == 3:
+if params.count(None) == 4:
 raise ValueError("No trigger provided")
-elif params.count(None) < 2:
+elif params.count(None) < 3:
 raise ValueError("Mul

[spark] branch master updated: [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

2021-11-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 775e05f  [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in 
Python
775e05f is described below

commit 775e05f2c3c31fc203cfe4b36df301555ce73ca4
Author: Hyukjin Kwon 
AuthorDate: Mon Nov 15 08:51:23 2021 +0900

[SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

### What changes were proposed in this pull request?

This PR proposes to implement `DataFrame.mapInArrow` that allows users to 
apply a function with PyArrow record batches such as:

```python
def do_something(iterator):
for arrow_batch in iterator:
# do something with `pyarrow.RecordBatch` and create new 
`pyarrow.RecordBatch`.
# ...
yield arrow_batch

df.mapInArrow(do_something, df.schema).show()
```

The general idea is simple. It shares the same codebase of 
`DataFrame.mapInPandas` except the pandas conversion logic.

This PR also piggy-backs:
- Removes the check in `spark.udf.register` on `SQL_MAP_PANDAS_ITER_UDF`. 
This type is only used for `DataFrame.mapInPandas` internally, and it cannot be 
registered as a SQL UDF
- Removes the type hints for `pandas_udf` that is used for internal 
purposes such as `SQL_MAP_PANDAS_ITER_UDF` and `SQL_COGROUPED_MAP_PANDAS_UDF`. 
Both cannot be used for `pandas_udf` as a SQL expression and it should be 
hidden to end users.

Note that documentation will be done in another PR.

### Why are the changes needed?

For usability and technical problems. Both are elabourated in more details 
at SPARK-37227.
Please also see the discussions at 
https://github.com/apache/spark/pull/26783.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new API:

```python
import pyarrow as pa

df = spark.createDataFrame(
[(1, "foo"), (2, None), (3, "bar"), (4, "bar")], "a int, b string")

def func(iterator):
for batch in iterator:
# `batch` is pyarrow.RecordBatch.
yield batch

df.mapInArrow(func, df.schema).collect()
```

### How was this patch tested?

Manually tested, and unit tests were added.

Closes #34505 from HyukjinKwon/SPARK-37228.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/api/python/PythonRunner.scala |   2 +
 dev/sparktestsupport/modules.py|   1 +
 python/pyspark/rdd.py  |   1 +
 python/pyspark/rdd.pyi |   2 +
 python/pyspark/sql/pandas/_typing/__init__.pyi |   9 +-
 python/pyspark/sql/pandas/functions.py |   8 +-
 python/pyspark/sql/pandas/functions.pyi|  42 ---
 python/pyspark/sql/pandas/group_ops.py |   4 +-
 python/pyspark/sql/pandas/map_ops.py   |  68 +-
 python/pyspark/sql/pandas/serializers.py   |  45 +++
 python/pyspark/sql/tests/test_arrow_map.py | 138 +
 python/pyspark/sql/udf.py  |  12 +-
 python/pyspark/worker.py   |  30 +++--
 .../catalyst/analysis/DeduplicateRelations.scala   |   6 +
 .../plans/logical/pythonLogicalOperators.scala |  15 +++
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  14 +++
 .../spark/sql/execution/SparkStrategies.scala  |   2 +
 ...{MapInPandasExec.scala => MapInBatchExec.scala} |  26 ++--
 .../sql/execution/python/MapInPandasExec.scala |  69 +--
 .../execution/python/PythonMapInArrowExec.scala|  38 ++
 20 files changed, 385 insertions(+), 147 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index bbe55cb..6a4871b 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -52,6 +52,7 @@ private[spark] object PythonEvalType {
   val SQL_SCALAR_PANDAS_ITER_UDF = 204
   val SQL_MAP_PANDAS_ITER_UDF = 205
   val SQL_COGROUPED_MAP_PANDAS_UDF = 206
+  val SQL_MAP_ARROW_ITER_UDF = 207
 
   def toString(pythonEvalType: Int): String = pythonEvalType match {
 case NON_UDF => "NON_UDF"
@@ -63,6 +64,7 @@ private[spark] object PythonEvalType {
 case SQL_SCALAR_PANDAS_ITER_UDF => "SQL_SCALAR_PANDAS_ITER_UDF"
 case SQL_MAP_PANDAS_ITER_UDF => "SQL_MAP_PANDAS_ITER_UDF"
 case SQL_COGROUPED_MAP_PANDAS_UDF => "SQL_COGROUPED_MAP_PANDAS_UDF"
+case SQL_MAP_ARROW_ITER_UDF => "SQL_MAP_ARROW_ITER_UDF"
   }
 }
 
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index b87218e..7d3ebb0 100644
---

[spark] branch branch-3.1 updated: [SPARK-37323][INFRA][3.1] Pin `docutils` to 0.17.x

2021-11-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 608c19c  [SPARK-37323][INFRA][3.1] Pin `docutils` to 0.17.x
608c19c is described below

commit 608c19c1e3cd18061c129ed5eb3d5e5bd54d6d13
Author: Dongjoon Hyun 
AuthorDate: Mon Nov 15 08:48:49 2021 +0900

[SPARK-37323][INFRA][3.1] Pin `docutils` to 0.17.x

### What changes were proposed in this pull request?

This PR aims to ping `docutils` to `0.17.x` to recover branch-3.1 GitHub 
Action linter job.

### Why are the changes needed?

`docutils` 0.18 is released October 26 and causes Python linter failure in 
branch-3.1.
- https://pypi.org/project/docutils/#history
- https://github.com/apache/spark/commits/branch-3.1

```
Exception occurred:
  File 
"/__t/Python/3.6.15/x64/lib/python3.6/site-packages/docutils/writers/html5_polyglot/__init__.py",
 line 445, in section_title_tags
if (ids and self.settings.section_self_link
AttributeError: 'Values' object has no attribute 'section_self_link'
The full traceback has been saved in /tmp/sphinx-err-y2ttd83t.log, if you 
want to report the issue to the developers.
Please also report this if it was a user error, so that a better error 
message can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the linter job on this PR.

Closes #34591 from dongjoon-hyun/SPARK-37323.

Authored-by: Dongjoon Hyun 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index a81e457..024c763 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -337,7 +337,7 @@ jobs:
 #   See also https://github.com/sphinx-doc/sphinx/issues/7551.
 # Jinja2 3.0.0+ causes error when building with Sphinx.
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
-python3.6 -m pip install flake8 'sphinx<3.1.0' numpy 
pydata_sphinx_theme ipython nbsphinx mypy numpydoc 'jinja2<3.0.0'
+python3.6 -m pip install flake8 'sphinx<3.1.0' numpy 
pydata_sphinx_theme ipython nbsphinx mypy numpydoc 'jinja2<3.0.0' 
'docutils<0.18'
 - name: Install R linter dependencies and SparkR
   run: |
 apt-get install -y libcurl4-openssl-dev libgit2-dev libssl-dev 
libxml2-dev

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37289][SQL] Remove `partitionSchemaOption` function

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 950422f  [SPARK-37289][SQL] Remove `partitionSchemaOption` function
950422f is described below

commit 950422f271e72124ac80b35ea319c01953066585
Author: tenglei <1037910...@qq.com>
AuthorDate: Sun Nov 14 12:48:30 2021 -0800

[SPARK-37289][SQL] Remove `partitionSchemaOption` function

### What changes were proposed in this pull request?

This PR remove the unnecessary function called partitionSchemaOption in 
HadoopFsRelation.scala.

### Why are the changes needed?

The partitionSchemaOption is unnecessary in HadoopFsRelation, it will make 
more complicated logical in HadoopFsRelation,which we can simply use 
partitionSchema.isEmpty or partitionSchema.nonEmpty instead

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass existed tests.

Closes #34582 from tenglei/master.

Authored-by: tenglei <1037910...@qq.com>
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/execution/DataSourceScanExec.scala | 4 ++--
 .../org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala | 3 ---
 .../spark/sql/execution/datasources/PruneFileSourcePartitions.scala   | 2 +-
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
index 6c04839..5ad1dc8 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
@@ -463,7 +463,7 @@ case class FileSourceScanExec(
   driverMetrics("staticFilesNum") = filesNum
   driverMetrics("staticFilesSize") = filesSize
 }
-if (relation.partitionSchemaOption.isDefined) {
+if (relation.partitionSchema.nonEmpty) {
   driverMetrics("numPartitions") = partitions.length
 }
   }
@@ -482,7 +482,7 @@ case class FileSourceScanExec(
   None
 }
   } ++ {
-if (relation.partitionSchemaOption.isDefined) {
+if (relation.partitionSchema.nonEmpty) {
   Map(
 "numPartitions" -> SQLMetrics.createMetric(sparkContext, "number of 
partitions read"),
 "pruningTime" ->
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
index 4ed8943..fd18240 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
@@ -57,9 +57,6 @@ case class HadoopFsRelation(
 PartitioningUtils.mergeDataAndPartitionSchema(dataSchema,
   partitionSchema, sparkSession.sessionState.conf.caseSensitiveAnalysis)
 
-  def partitionSchemaOption: Option[StructType] =
-if (partitionSchema.isEmpty) None else Some(partitionSchema)
-
   override def toString: String = {
 fileFormat match {
   case source: DataSourceRegister => source.shortName()
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
index 2e8e542..be70e18 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
@@ -62,7 +62,7 @@ private[sql] object PruneFileSourcePartitions
 _,
 _,
 _))
-if filters.nonEmpty && fsRelation.partitionSchemaOption.isDefined =>
+if filters.nonEmpty && fsRelation.partitionSchema.nonEmpty =>
   val normalizedFilters = DataSourceStrategy.normalizeExprs(
 filters.filter(f => f.deterministic && 
!SubqueryExpression.hasSubquery(f)),
 logicalRelation.output)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-37322][TESTS] `run_scala_tests` should respect test module order

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 3d1427fb [SPARK-37322][TESTS] `run_scala_tests` should respect test 
module order
3d1427fb is described below

commit 3d1427fb16c315f2dd9bf8914f3be1be56477583
Author: William Hyun 
AuthorDate: Sun Nov 14 12:32:36 2021 -0800

[SPARK-37322][TESTS] `run_scala_tests` should respect test module order

### What changes were proposed in this pull request?
This PR aims to make `run_scala_tests` respect test module order

### Why are the changes needed?
Currently the execution order is random.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually through the following and check if the catalyst module runs first.

```
$ SKIP_MIMA=1 SKIP_UNIDOC=1 ./dev/run-tests --parallelism 1 --modules 
"catalyst,hive-thriftserver"
```

Closes #34590 from williamhyun/order.

Authored-by: William Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit c2221a8e3d95ce22d76208c705179c5954318567)
Signed-off-by: Dongjoon Hyun 
---
 dev/run-tests.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/dev/run-tests.py b/dev/run-tests.py
index 37a15a7..7f7ecba 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -445,7 +445,8 @@ def run_scala_tests(build_tool, extra_profiles, 
test_modules, excluded_tags, inc
 `determine_test_suites` function"""
 set_title_and_block("Running Spark unit tests", "BLOCK_SPARK_UNIT_TESTS")
 
-test_modules = set(test_modules)
+# Remove duplicates while keeping the test module order
+test_modules = list(dict.fromkeys(test_modules))
 
 test_profiles = extra_profiles + \
 list(set(itertools.chain.from_iterable(m.build_profile_flags for m in 
test_modules)))

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37322][TESTS] `run_scala_tests` should respect test module order

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 21d901b  [SPARK-37322][TESTS] `run_scala_tests` should respect test 
module order
21d901b is described below

commit 21d901bb362ac735bed34a0b9731a5d2c6aa20ae
Author: William Hyun 
AuthorDate: Sun Nov 14 12:32:36 2021 -0800

[SPARK-37322][TESTS] `run_scala_tests` should respect test module order

### What changes were proposed in this pull request?
This PR aims to make `run_scala_tests` respect test module order

### Why are the changes needed?
Currently the execution order is random.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually through the following and check if the catalyst module runs first.

```
$ SKIP_MIMA=1 SKIP_UNIDOC=1 ./dev/run-tests --parallelism 1 --modules 
"catalyst,hive-thriftserver"
```

Closes #34590 from williamhyun/order.

Authored-by: William Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit c2221a8e3d95ce22d76208c705179c5954318567)
Signed-off-by: Dongjoon Hyun 
---
 dev/run-tests.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/dev/run-tests.py b/dev/run-tests.py
index c54042c..545905b 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -469,7 +469,8 @@ def run_scala_tests(build_tool, extra_profiles, 
test_modules, excluded_tags, inc
 `determine_test_suites` function"""
 set_title_and_block("Running Spark unit tests", "BLOCK_SPARK_UNIT_TESTS")
 
-test_modules = set(test_modules)
+# Remove duplicates while keeping the test module order
+test_modules = list(dict.fromkeys(test_modules))
 
 test_profiles = extra_profiles + \
 list(set(itertools.chain.from_iterable(m.build_profile_flags for m in 
test_modules)))

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37322][TESTS] `run_scala_tests` should respect test module order

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c2221a8  [SPARK-37322][TESTS] `run_scala_tests` should respect test 
module order
c2221a8 is described below

commit c2221a8e3d95ce22d76208c705179c5954318567
Author: William Hyun 
AuthorDate: Sun Nov 14 12:32:36 2021 -0800

[SPARK-37322][TESTS] `run_scala_tests` should respect test module order

### What changes were proposed in this pull request?
This PR aims to make `run_scala_tests` respect test module order

### Why are the changes needed?
Currently the execution order is random.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually through the following and check if the catalyst module runs first.

```
$ SKIP_MIMA=1 SKIP_UNIDOC=1 ./dev/run-tests --parallelism 1 --modules 
"catalyst,hive-thriftserver"
```

Closes #34590 from williamhyun/order.

Authored-by: William Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/run-tests.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/dev/run-tests.py b/dev/run-tests.py
index 0f0759b..55c65ed 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -469,7 +469,8 @@ def run_scala_tests(build_tool, extra_profiles, 
test_modules, excluded_tags, inc
 `determine_test_suites` function"""
 set_title_and_block("Running Spark unit tests", "BLOCK_SPARK_UNIT_TESTS")
 
-test_modules = set(test_modules)
+# Remove duplicates while keeping the test module order
+test_modules = list(dict.fromkeys(test_modules))
 
 test_profiles = extra_profiles + \
 list(set(itertools.chain.from_iterable(m.build_profile_flags for m in 
test_modules)))

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 45aee7a  [SPARK-37317][MLLIB][TESTS] Reduce weights in 
GaussianMixtureSuite
45aee7a is described below

commit 45aee7ad599f15006396f042cd847e3e068ab9b6
Author: Dongjoon Hyun 
AuthorDate: Sun Nov 14 08:41:48 2021 -0800

[SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite

### What changes were proposed in this pull request?

This PR aims to reduce the test weight `100` to `90` to improve the 
robustness of test case `GMM support instance weighting`.
```scala
test("GMM support instance weighting") {
  val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed)
  val gm2 = new 
GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight")
- Seq(1.0, 10.0, 100.0).foreach { w =>
+ Seq(1.0, 10.0, 90.0).foreach { w =>
```

### Why are the changes needed?

As mentioned in 
https://github.com/apache/spark/pull/26735#discussion_r352551174, the weights 
of model changes when the weights grow. And, the final weight `100` seems to be 
high enough to cause failures on some JVMs. This is observed in Java 17 M1 
native mode.

```
$ java -version
openjdk version "17" 2021-09-14 LTS
OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS)
OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, 
sharing)
```

**BEFORE**
```
$ build/sbt "mllib/test"
...
[info] - GMM support instance weighting *** FAILED *** (1 second, 722 
milliseconds)
[info]   Expected 0.10476714410584752 and 1.209081654091291E-14 to be 
within 0.001 using absolute tolerance. (TestingUtils.scala:88)
...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7
[error] Failed tests:
[error] org.apache.spark.ml.clustering.GaussianMixtureSuite
[error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM
```

**AFTER**
```
[info] Total number of tests run: 1638
[info] Suites: completed 205, aborted 0
[info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0
[info] All tests passed.
[info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7
[success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #34584 from dongjoon-hyun/SPARK-37317.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a1f2ae0a637629796db95643ea7443a1c33bad41)
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
index 0eae23d..36ed322 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
@@ -268,7 +268,7 @@ class GaussianMixtureSuite extends MLTest with 
DefaultReadWriteTest {
 val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed)
 val gm2 = new 
GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight")
 
-Seq(1.0, 10.0, 100.0).foreach { w =>
+Seq(1.0, 10.0, 90.0).foreach { w =>
   val gmm1 = gm1.fit(dataset)
   val ds2 = dataset.select(col("features"), lit(w).as("weight"))
   val gmm2 = gm2.fit(ds2)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 97a8b10  [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip 
after the test in DepsTestsSuite finishes
97a8b10 is described below

commit 97a8b102076c85f9f4bfa3f6462a20e6ca3523a3
Author: Kousuke Saruta 
AuthorDate: Sun Nov 14 08:45:20 2021 -0800

[SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in 
DepsTestsSuite finishes

### What changes were proposed in this pull request?

This PR fixes an issue that `py_container_checks.zip` still remains in 
`resource-managers/kubernetes/integration-tests/tests/` even after the test 
`Launcher python client dependencies using a zip file` in `DepsTestsSuite` 
finishes.

### Why are the changes needed?

To keep the repository clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed that the zip file will be removed after the test finishes with 
the following command using MiniKube.
```
PVC_TESTS_HOST_PATH=/path PVC_TESTS_VM_PATH=/path build/mvn 
-Dspark.kubernetes.test.namespace=default -Pkubernetes 
-Pkubernetes-integration-tests -pl 
resource-managers/kubernetes/integration-tests integration-test
```

Closes #34588 from sarutak/remove-zip-k8s.

Authored-by: Kousuke Saruta 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d8cee85e0264fe879f9d1eeec7541a8e94ff83f6)
Signed-off-by: Dongjoon Hyun 
---
 .../k8s/integrationtest/DepsTestsSuite.scala   | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
index 9701f03..59770f2 100644
--- 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
@@ -16,7 +16,9 @@
  */
 package org.apache.spark.deploy.k8s.integrationtest
 
+import java.io.File
 import java.net.URL
+import java.nio.file.Files
 
 import scala.collection.JavaConverters._
 
@@ -223,14 +225,18 @@ private[spark] trait DepsTestsSuite { k8sSuite: 
KubernetesSuite =>
 val pySparkFiles = Utils.getTestFileAbsolutePath("pyfiles.py", 
sparkHomeDir)
 val inDepsFile = Utils.getTestFileAbsolutePath("py_container_checks.py", 
sparkHomeDir)
 val outDepsFile = s"${inDepsFile.substring(0, 
inDepsFile.lastIndexOf("."))}.zip"
-Utils.createZipFile(inDepsFile, outDepsFile)
-testPython(
-  pySparkFiles,
-  Seq(
-"Python runtime version check is: True",
-"Python environment version check is: True",
-"Python runtime version check for executor is: True"),
-  Some(outDepsFile))
+try {
+  Utils.createZipFile(inDepsFile, outDepsFile)
+  testPython(
+pySparkFiles,
+Seq(
+  "Python runtime version check is: True",
+  "Python environment version check is: True",
+  "Python runtime version check for executor is: True"),
+Some(outDepsFile))
+} finally {
+  Files.delete(new File(outDepsFile).toPath)
+}
   }
 
   private def testPython(

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 96c17d3  [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip 
after the test in DepsTestsSuite finishes
96c17d3 is described below

commit 96c17d31ff85d2e2be5ddaf51359421a6ac027a0
Author: Kousuke Saruta 
AuthorDate: Sun Nov 14 08:45:20 2021 -0800

[SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in 
DepsTestsSuite finishes

### What changes were proposed in this pull request?

This PR fixes an issue that `py_container_checks.zip` still remains in 
`resource-managers/kubernetes/integration-tests/tests/` even after the test 
`Launcher python client dependencies using a zip file` in `DepsTestsSuite` 
finishes.

### Why are the changes needed?

To keep the repository clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed that the zip file will be removed after the test finishes with 
the following command using MiniKube.
```
PVC_TESTS_HOST_PATH=/path PVC_TESTS_VM_PATH=/path build/mvn 
-Dspark.kubernetes.test.namespace=default -Pkubernetes 
-Pkubernetes-integration-tests -pl 
resource-managers/kubernetes/integration-tests integration-test
```

Closes #34588 from sarutak/remove-zip-k8s.

Authored-by: Kousuke Saruta 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d8cee85e0264fe879f9d1eeec7541a8e94ff83f6)
Signed-off-by: Dongjoon Hyun 
---
 .../k8s/integrationtest/DepsTestsSuite.scala   | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
index 812fd21..3f3c4ef 100644
--- 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
@@ -16,7 +16,9 @@
  */
 package org.apache.spark.deploy.k8s.integrationtest
 
+import java.io.File
 import java.net.URL
+import java.nio.file.Files
 
 import scala.collection.JavaConverters._
 
@@ -223,14 +225,18 @@ private[spark] trait DepsTestsSuite { k8sSuite: 
KubernetesSuite =>
 val pySparkFiles = Utils.getTestFileAbsolutePath("pyfiles.py", 
sparkHomeDir)
 val inDepsFile = Utils.getTestFileAbsolutePath("py_container_checks.py", 
sparkHomeDir)
 val outDepsFile = s"${inDepsFile.substring(0, 
inDepsFile.lastIndexOf("."))}.zip"
-Utils.createZipFile(inDepsFile, outDepsFile)
-testPython(
-  pySparkFiles,
-  Seq(
-"Python runtime version check is: True",
-"Python environment version check is: True",
-"Python runtime version check for executor is: True"),
-  Some(outDepsFile))
+try {
+  Utils.createZipFile(inDepsFile, outDepsFile)
+  testPython(
+pySparkFiles,
+Seq(
+  "Python runtime version check is: True",
+  "Python environment version check is: True",
+  "Python runtime version check for executor is: True"),
+Some(outDepsFile))
+} finally {
+  Files.delete(new File(outDepsFile).toPath)
+}
   }
 
   private def testPython(

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d8cee85  [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip 
after the test in DepsTestsSuite finishes
d8cee85 is described below

commit d8cee85e0264fe879f9d1eeec7541a8e94ff83f6
Author: Kousuke Saruta 
AuthorDate: Sun Nov 14 08:45:20 2021 -0800

[SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in 
DepsTestsSuite finishes

### What changes were proposed in this pull request?

This PR fixes an issue that `py_container_checks.zip` still remains in 
`resource-managers/kubernetes/integration-tests/tests/` even after the test 
`Launcher python client dependencies using a zip file` in `DepsTestsSuite` 
finishes.

### Why are the changes needed?

To keep the repository clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed that the zip file will be removed after the test finishes with 
the following command using MiniKube.
```
PVC_TESTS_HOST_PATH=/path PVC_TESTS_VM_PATH=/path build/mvn 
-Dspark.kubernetes.test.namespace=default -Pkubernetes 
-Pkubernetes-integration-tests -pl 
resource-managers/kubernetes/integration-tests integration-test
```

Closes #34588 from sarutak/remove-zip-k8s.

Authored-by: Kousuke Saruta 
Signed-off-by: Dongjoon Hyun 
---
 .../k8s/integrationtest/DepsTestsSuite.scala   | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
index 812fd21..3f3c4ef 100644
--- 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
@@ -16,7 +16,9 @@
  */
 package org.apache.spark.deploy.k8s.integrationtest
 
+import java.io.File
 import java.net.URL
+import java.nio.file.Files
 
 import scala.collection.JavaConverters._
 
@@ -223,14 +225,18 @@ private[spark] trait DepsTestsSuite { k8sSuite: 
KubernetesSuite =>
 val pySparkFiles = Utils.getTestFileAbsolutePath("pyfiles.py", 
sparkHomeDir)
 val inDepsFile = Utils.getTestFileAbsolutePath("py_container_checks.py", 
sparkHomeDir)
 val outDepsFile = s"${inDepsFile.substring(0, 
inDepsFile.lastIndexOf("."))}.zip"
-Utils.createZipFile(inDepsFile, outDepsFile)
-testPython(
-  pySparkFiles,
-  Seq(
-"Python runtime version check is: True",
-"Python environment version check is: True",
-"Python runtime version check for executor is: True"),
-  Some(outDepsFile))
+try {
+  Utils.createZipFile(inDepsFile, outDepsFile)
+  testPython(
+pySparkFiles,
+Seq(
+  "Python runtime version check is: True",
+  "Python environment version check is: True",
+  "Python runtime version check for executor is: True"),
+Some(outDepsFile))
+} finally {
+  Files.delete(new File(outDepsFile).toPath)
+}
   }
 
   private def testPython(

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a1f2ae0  [SPARK-37317][MLLIB][TESTS] Reduce weights in 
GaussianMixtureSuite
a1f2ae0 is described below

commit a1f2ae0a637629796db95643ea7443a1c33bad41
Author: Dongjoon Hyun 
AuthorDate: Sun Nov 14 08:41:48 2021 -0800

[SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite

### What changes were proposed in this pull request?

This PR aims to reduce the test weight `100` to `90` to improve the 
robustness of test case `GMM support instance weighting`.
```scala
test("GMM support instance weighting") {
  val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed)
  val gm2 = new 
GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight")
- Seq(1.0, 10.0, 100.0).foreach { w =>
+ Seq(1.0, 10.0, 90.0).foreach { w =>
```

### Why are the changes needed?

As mentioned in 
https://github.com/apache/spark/pull/26735#discussion_r352551174, the weights 
of model changes when the weights grow. And, the final weight `100` seems to be 
high enough to cause failures on some JVMs. This is observed in Java 17 M1 
native mode.

```
$ java -version
openjdk version "17" 2021-09-14 LTS
OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS)
OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, 
sharing)
```

**BEFORE**
```
$ build/sbt "mllib/test"
...
[info] - GMM support instance weighting *** FAILED *** (1 second, 722 
milliseconds)
[info]   Expected 0.10476714410584752 and 1.209081654091291E-14 to be 
within 0.001 using absolute tolerance. (TestingUtils.scala:88)
...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7
[error] Failed tests:
[error] org.apache.spark.ml.clustering.GaussianMixtureSuite
[error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM
```

**AFTER**
```
[info] Total number of tests run: 1638
[info] Suites: completed 205, aborted 0
[info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0
[info] All tests passed.
[info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7
[success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #34584 from dongjoon-hyun/SPARK-37317.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
index 0eae23d..36ed322 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
@@ -268,7 +268,7 @@ class GaussianMixtureSuite extends MLTest with 
DefaultReadWriteTest {
 val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed)
 val gm2 = new 
GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight")
 
-Seq(1.0, 10.0, 100.0).foreach { w =>
+Seq(1.0, 10.0, 90.0).foreach { w =>
   val gmm1 = gm1.fit(dataset)
   val ds2 = dataset.select(col("features"), lit(w).as("weight"))
   val gmm2 = gm2.fit(ds2)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of DNS

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new e0a0d4c  [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust 
in terms of DNS
e0a0d4c is described below

commit e0a0d4c34c8bacc9c45993435ffbbc6be333cc6f
Author: Dongjoon Hyun 
AuthorDate: Sun Nov 14 01:02:12 2021 -0800

[SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of 
DNS

### What changes were proposed in this pull request?

This PR aims to make `FallbackStorageSuite` robust in terms of DNS.

### Why are the changes needed?

The test case expects the hostname doesn't exist and it actually doesn't 
exist.
```
$ nslookup remote
Server: 8.8.8.8
Address:8.8.8.8#53

** server can't find remote: NXDOMAIN

$ ping remote
ping: cannot resolve remote: Unknown host
```

However, in some DNS environments, all hostnames including non-existent 
names seems to be handled like the existing hostnames.

```
$ nslookup remote
Server: 172.16.0.1
Address:172.16.0.1#53

Non-authoritative answer:
Name:   remote
Address: 23.217.138.110

$ ping remote
PING remote (23.217.138.110): 56 data bytes
64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms

$ build/sbt "core/testOnly *.FallbackStorageSuite"
...
[info] Run completed in 2 minutes, 31 seconds.
[info] Total number of tests run: 9
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 3, failed 6, canceled 0, ignored 0, pending 0
[info] *** 6 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.storage.FallbackStorageSuite
[error] (core / Test / testOnly) sbt.TestsFailedException: Tests 
unsuccessful
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ build/sbt "core/testOnly *.FallbackStorageSuite"
...
[info] Run completed in 3 seconds, 322 milliseconds.
[info] Total number of tests run: 3
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 3, failed 0, canceled 6, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 22 s, completed Nov 13, 2021 7:11:31 PM
```

Closes #34585 from dongjoon-hyun/SPARK-37318.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 41f9df92061ab96ce7729f0e2a107a3569046c58)
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/storage/FallbackStorageSuite.scala | 8 
 1 file changed, 8 insertions(+)

diff --git 
a/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala 
b/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
index 88197b6..7d648c9 100644
--- a/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
+++ b/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
@@ -17,6 +17,7 @@
 package org.apache.spark.storage
 
 import java.io.{DataOutputStream, File, FileOutputStream, IOException}
+import java.net.{InetAddress, UnknownHostException}
 import java.nio.file.Files
 
 import scala.concurrent.duration._
@@ -41,6 +42,13 @@ import org.apache.spark.util.Utils.tryWithResource
 class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext {
 
   def getSparkConf(initialExecutor: Int = 1, minExecutor: Int = 1): SparkConf 
= {
+// Some DNS always replies for all hostnames including unknown host names
+try {
+  InetAddress.getByName(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID.host)
+  assume(false)
+} catch {
+  case _: UnknownHostException =>
+}
 new SparkConf(false)
   .setAppName(getClass.getName)
   .set(SPARK_MASTER, s"local-cluster[$initialExecutor,1,1024]")

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of DNS

2021-11-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 41f9df9  [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust 
in terms of DNS
41f9df9 is described below

commit 41f9df92061ab96ce7729f0e2a107a3569046c58
Author: Dongjoon Hyun 
AuthorDate: Sun Nov 14 01:02:12 2021 -0800

[SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of 
DNS

### What changes were proposed in this pull request?

This PR aims to make `FallbackStorageSuite` robust in terms of DNS.

### Why are the changes needed?

The test case expects the hostname doesn't exist and it actually doesn't 
exist.
```
$ nslookup remote
Server: 8.8.8.8
Address:8.8.8.8#53

** server can't find remote: NXDOMAIN

$ ping remote
ping: cannot resolve remote: Unknown host
```

However, in some DNS environments, all hostnames including non-existent 
names seems to be handled like the existing hostnames.

```
$ nslookup remote
Server: 172.16.0.1
Address:172.16.0.1#53

Non-authoritative answer:
Name:   remote
Address: 23.217.138.110

$ ping remote
PING remote (23.217.138.110): 56 data bytes
64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms

$ build/sbt "core/testOnly *.FallbackStorageSuite"
...
[info] Run completed in 2 minutes, 31 seconds.
[info] Total number of tests run: 9
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 3, failed 6, canceled 0, ignored 0, pending 0
[info] *** 6 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.storage.FallbackStorageSuite
[error] (core / Test / testOnly) sbt.TestsFailedException: Tests 
unsuccessful
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ build/sbt "core/testOnly *.FallbackStorageSuite"
...
[info] Run completed in 3 seconds, 322 milliseconds.
[info] Total number of tests run: 3
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 3, failed 0, canceled 6, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 22 s, completed Nov 13, 2021 7:11:31 PM
```

Closes #34585 from dongjoon-hyun/SPARK-37318.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/storage/FallbackStorageSuite.scala | 8 
 1 file changed, 8 insertions(+)

diff --git 
a/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala 
b/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
index 88197b6..7d648c9 100644
--- a/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
+++ b/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
@@ -17,6 +17,7 @@
 package org.apache.spark.storage
 
 import java.io.{DataOutputStream, File, FileOutputStream, IOException}
+import java.net.{InetAddress, UnknownHostException}
 import java.nio.file.Files
 
 import scala.concurrent.duration._
@@ -41,6 +42,13 @@ import org.apache.spark.util.Utils.tryWithResource
 class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext {
 
   def getSparkConf(initialExecutor: Int = 1, minExecutor: Int = 1): SparkConf 
= {
+// Some DNS always replies for all hostnames including unknown host names
+try {
+  InetAddress.getByName(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID.host)
+  assume(false)
+} catch {
+  case _: UnknownHostException =>
+}
 new SparkConf(false)
   .setAppName(getClass.getName)
   .set(SPARK_MASTER, s"local-cluster[$initialExecutor,1,1024]")

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-32567][SQL][FOLLOWUP] Fix CompileException when code generate for FULL OUTER shuffled hash join

[spark] branch master updated (bb9e1d9 -> f43d8b5)

[spark] branch master updated (edbc7cf -> bb9e1d9)

[spark] branch master updated: [SPARK-36533][SS][FOLLOWUP] Support Trigger.AvailableNow in PySpark

[spark] branch master updated: [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

[spark] branch branch-3.1 updated: [SPARK-37323][INFRA][3.1] Pin `docutils` to 0.17.x

[spark] branch master updated: [SPARK-37289][SQL] Remove `partitionSchemaOption` function

[spark] branch branch-3.1 updated: [SPARK-37322][TESTS] `run_scala_tests` should respect test module order

[spark] branch branch-3.2 updated: [SPARK-37322][TESTS] `run_scala_tests` should respect test module order

[spark] branch master updated: [SPARK-37322][TESTS] `run_scala_tests` should respect test module order

[spark] branch branch-3.2 updated: [SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite

[spark] branch branch-3.1 updated: [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes

[spark] branch branch-3.2 updated: [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes

[spark] branch master updated: [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes

[spark] branch master updated: [SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite

[spark] branch branch-3.2 updated: [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of DNS

[spark] branch master updated: [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of DNS

17 matches

Site Navigation

Mail list logo

Footer information