[spark] branch master updated: [SPARK-38153][BUILD] Remove option newlines.topLevelStatements in scalafmt.conf
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 69c2c34 [SPARK-38153][BUILD] Remove option newlines.topLevelStatements in scalafmt.conf 69c2c34 is described below commit 69c2c34e80a21927c3ca3664c477748ff47a3804 Author: Gengliang Wang AuthorDate: Tue Feb 8 23:01:24 2022 -0800 [SPARK-38153][BUILD] Remove option newlines.topLevelStatements in scalafmt.conf ### What changes were proposed in this pull request? Remove option newlines.topLevelStatements in scalafmt.conf ### Why are the changes needed? The configuration ``` newlines.topLevelStatements = [before,after] ``` is to add a blank line before the first member or after the last member of the class. This is neither encouraged nor discouraged as per https://github.com/databricks/scala-style-guide#blanklines **Without the conf, scalafmt will still add blank lines between consecutive members (or initializers) of a class.** As I tried running the script `./dev/scalafmt`, I saw unnessary blank lines ![image](https://user-images.githubusercontent.com/1097932/153122925-3238f15c-312b-4973-8e2d-92978bf5c6ad.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test Closes #35455 from gengliangwang/removeToplevelLine. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- dev/.scalafmt.conf | 1 - 1 file changed, 1 deletion(-) diff --git a/dev/.scalafmt.conf b/dev/.scalafmt.conf index 9598540..d2196e6 100644 --- a/dev/.scalafmt.conf +++ b/dev/.scalafmt.conf @@ -25,4 +25,3 @@ optIn = { danglingParentheses = false docstrings = JavaDoc maxColumn = 98 -newlines.topLevelStatements = [before,after] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (fdda0eb -> 8a559b3)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from fdda0eb [SPARK-38147][BUILD][MLLIB] Upgrade `shapeless` to 2.3.7 add 8a559b3 [SPARK-38149][BUILD] Upgrade joda-time to 2.10.13 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38147][BUILD][MLLIB] Upgrade `shapeless` to 2.3.7
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fdda0eb [SPARK-38147][BUILD][MLLIB] Upgrade `shapeless` to 2.3.7 fdda0eb is described below commit fdda0ebc752e6c8d9d71a6a83454c9c072c6d743 Author: Dongjoon Hyun AuthorDate: Tue Feb 8 20:56:11 2022 -0800 [SPARK-38147][BUILD][MLLIB] Upgrade `shapeless` to 2.3.7 ### What changes were proposed in this pull request? This PR aims to upgrade `shapeless` to 2.3.7. ### Why are the changes needed? This will bring the latest bug fixes. - https://github.com/milessabin/shapeless/releases/tag/v2.3.7 (Released on May 16, 2021) - https://github.com/milessabin/shapeless/releases/tag/v2.3.6 (This is recommended to skip.) - https://github.com/milessabin/shapeless/releases/tag/v2.3.5 (This is recommended to skip.) - https://github.com/milessabin/shapeless/releases/tag/v2.3.4 ### Does this PR introduce _any_ user-facing change? No. `v2.3.7` is backward binary-compatible with `v2.3.3`. - https://github.com/milessabin/shapeless/releases/tag/v2.3.7 ### How was this patch tested? Pass the CIs. Closes #35450 from dongjoon-hyun/SPARK-38147. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-2-hive-2.3 | 3 +-- dev/deps/spark-deps-hadoop-3-hive-2.3 | 3 +-- pom.xml | 5 + 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 b/dev/deps/spark-deps-hadoop-2-hive-2.3 index 8284237..63e3f87 100644 --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 @@ -193,7 +193,6 @@ log4j-core/2.17.1//log4j-core-2.17.1.jar log4j-slf4j-impl/2.17.1//log4j-slf4j-impl-2.17.1.jar logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar lz4-java/1.8.0//lz4-java-1.8.0.jar -macro-compat_2.12/1.1.1//macro-compat_2.12-1.1.1.jar mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar metrics-core/4.2.7//metrics-core-4.2.7.jar metrics-graphite/4.2.7//metrics-graphite-4.2.7.jar @@ -243,7 +242,7 @@ scala-library/2.12.15//scala-library-2.12.15.jar scala-parser-combinators_2.12/1.1.2//scala-parser-combinators_2.12-1.1.2.jar scala-reflect/2.12.15//scala-reflect-2.12.15.jar scala-xml_2.12/1.2.0//scala-xml_2.12-1.2.0.jar -shapeless_2.12/2.3.3//shapeless_2.12-2.3.3.jar +shapeless_2.12/2.3.7//shapeless_2.12-2.3.7.jar shims/0.9.23//shims-0.9.23.jar slf4j-api/1.7.32//slf4j-api-1.7.32.jar snakeyaml/1.28//snakeyaml-1.28.jar diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index f169277..88cd560 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -179,7 +179,6 @@ log4j-core/2.17.1//log4j-core-2.17.1.jar log4j-slf4j-impl/2.17.1//log4j-slf4j-impl-2.17.1.jar logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar lz4-java/1.8.0//lz4-java-1.8.0.jar -macro-compat_2.12/1.1.1//macro-compat_2.12-1.1.1.jar mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar metrics-core/4.2.7//metrics-core-4.2.7.jar metrics-graphite/4.2.7//metrics-graphite-4.2.7.jar @@ -229,7 +228,7 @@ scala-library/2.12.15//scala-library-2.12.15.jar scala-parser-combinators_2.12/1.1.2//scala-parser-combinators_2.12-1.1.2.jar scala-reflect/2.12.15//scala-reflect-2.12.15.jar scala-xml_2.12/1.2.0//scala-xml_2.12-1.2.0.jar -shapeless_2.12/2.3.3//shapeless_2.12-2.3.3.jar +shapeless_2.12/2.3.7//shapeless_2.12-2.3.7.jar shims/0.9.23//shims-0.9.23.jar slf4j-api/1.7.32//slf4j-api-1.7.32.jar snakeyaml/1.28//snakeyaml-1.28.jar diff --git a/pom.xml b/pom.xml index f8f13fc..496c370 100644 --- a/pom.xml +++ b/pom.xml @@ -1027,6 +1027,11 @@ +com.chuusai +shapeless_${scala.binary.version} +2.3.7 + + org.json4s json4s-jackson_${scala.binary.version} 3.7.0-M11 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0af4fc8 -> de5e45a)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0af4fc8 [SPARK-37414][PYTHON][ML] Inline type hints for pyspark.ml.tuning add de5e45a [SPARK-38086][SQL] Make ArrowColumnVector Extendable No new revisions were added by this update. Summary of changes: .../spark/sql/vectorized/ArrowColumnVector.java| 59 +- .../sql/vectorized/ArrowColumnVectorSuite.scala| 31 2 files changed, 65 insertions(+), 25 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated (1aa9ef0 -> 66e73c4)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from 1aa9ef0 [SPARK-38144][CORE] Remove unused `spark.storage.safetyFraction` config add 66e73c4 [SPARK-38030][SQL][3.1] Canonicalization should not remove nullability of AttributeReference dataType No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/Canonicalize.scala| 7 +++ .../catalyst/expressions/CanonicalizeSuite.scala | 15 +- .../adaptive/AdaptiveQueryExecSuite.scala | 24 -- 3 files changed, 39 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (ba3c8c5 -> d62735d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from ba3c8c5 [SPARK-38144][CORE] Remove unused `spark.storage.safetyFraction` config add d62735d [SPARK-38030][SQL][3.2] Canonicalization should not remove nullability of AttributeReference dataType No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/Canonicalize.scala | 7 +++ .../sql/catalyst/expressions/CanonicalizeSuite.scala | 15 ++- 2 files changed, 17 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37414][PYTHON][ML] Inline type hints for pyspark.ml.tuning
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0af4fc8 [SPARK-37414][PYTHON][ML] Inline type hints for pyspark.ml.tuning 0af4fc8 is described below commit 0af4fc8ea837d9d7f7f023e2eb3b9807fb073db1 Author: zero323 AuthorDate: Wed Feb 9 03:16:20 2022 +0100 [SPARK-37414][PYTHON][ML] Inline type hints for pyspark.ml.tuning ### What changes were proposed in this pull request? This PR migrates type `pyspark.ml.tuning` annotations from stub file to inline type hints. ### Why are the changes needed? Part of ongoing migration of type hints. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #35406 from zero323/SPARK-37414. Authored-by: zero323 Signed-off-by: zero323 --- python/pyspark/ml/base.py | 3 +- python/pyspark/ml/param/__init__.py | 2 +- python/pyspark/ml/tuning.py | 476 ++-- python/pyspark/ml/tuning.pyi| 228 - python/pyspark/ml/util.py | 7 +- 5 files changed, 302 insertions(+), 414 deletions(-) diff --git a/python/pyspark/ml/base.py b/python/pyspark/ml/base.py index 9e8252d..20540eb 100644 --- a/python/pyspark/ml/base.py +++ b/python/pyspark/ml/base.py @@ -24,7 +24,6 @@ from typing import ( Any, Callable, Generic, -Iterable, Iterator, List, Optional, @@ -133,7 +132,7 @@ class Estimator(Params, Generic[M], metaclass=ABCMeta): def fitMultiple( self, dataset: DataFrame, paramMaps: Sequence["ParamMap"] -) -> Iterable[Tuple[int, M]]: +) -> Iterator[Tuple[int, M]]: """ Fits a model to the input dataset for each param map in `paramMaps`. diff --git a/python/pyspark/ml/param/__init__.py b/python/pyspark/ml/param/__init__.py index ee2c289..6c223c6 100644 --- a/python/pyspark/ml/param/__init__.py +++ b/python/pyspark/ml/param/__init__.py @@ -569,7 +569,7 @@ class Params(Identifiable, metaclass=ABCMeta): to._set(**{param.name: paramMap[param]}) return to -def _resetUid(self: "P", newUid: Any) -> "P": +def _resetUid(self: P, newUid: Any) -> P: """ Changes the uid of this instance. This updates both the stored uid and the parent uid of params and param maps. diff --git a/python/pyspark/ml/tuning.py b/python/pyspark/ml/tuning.py index 47805c9..9fae5fe 100644 --- a/python/pyspark/ml/tuning.py +++ b/python/pyspark/ml/tuning.py @@ -20,12 +20,28 @@ import sys import itertools from multiprocessing.pool import ThreadPool +from typing import ( +Any, +Callable, +Dict, +Iterable, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + import numpy as np from pyspark import keyword_only, since, SparkContext, inheritable_thread_target from pyspark.ml import Estimator, Transformer, Model from pyspark.ml.common import inherit_doc, _py2java, _java2py -from pyspark.ml.evaluation import Evaluator +from pyspark.ml.evaluation import Evaluator, JavaEvaluator from pyspark.ml.param import Params, Param, TypeConverters from pyspark.ml.param.shared import HasCollectSubModels, HasParallelism, HasSeed from pyspark.ml.util import ( @@ -43,6 +59,13 @@ from pyspark.ml.wrapper import JavaParams, JavaEstimator, JavaWrapper from pyspark.sql.functions import col, lit, rand, UserDefinedFunction from pyspark.sql.types import BooleanType +from pyspark.sql.dataframe import DataFrame + +if TYPE_CHECKING: +from pyspark.ml._typing import ParamMap +from py4j.java_gateway import JavaObject # type: ignore[import] +from py4j.java_collections import JavaArray # type: ignore[import] + __all__ = [ "ParamGridBuilder", "CrossValidator", @@ -52,7 +75,14 @@ __all__ = [ ] -def _parallelFitTasks(est, train, eva, validation, epm, collectSubModel): +def _parallelFitTasks( +est: Estimator, +train: DataFrame, +eva: Evaluator, +validation: DataFrame, +epm: Sequence["ParamMap"], +collectSubModel: bool, +) -> List[Callable[[], Tuple[int, float, Transformer]]]: """ Creates a list of callables which can be called from different threads to fit and evaluate an estimator in parallel. Each callable returns an `(index, metric)` pair. @@ -79,7 +109,7 @@ def _parallelFitTasks(est, train, eva, validation, epm, collectSubModel): """ modelIter = est.fitMultiple(train, epm) -def singleTask(): +def singleTask() -> Tuple[int, float, Transformer]: index, model = next(modelIter) # TODO: duplicate evaluator to take extra params from input # Note: Supporting tuning params in
[spark] branch master updated: [SPARK-37404][PYTHON][ML] Inline type hints for pyspark.ml.evaluation.py
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2a9416a [SPARK-37404][PYTHON][ML] Inline type hints for pyspark.ml.evaluation.py 2a9416a is described below commit 2a9416abd5c8d83901c09a08433bf91e649dc43f Author: zero323 AuthorDate: Wed Feb 9 01:32:57 2022 +0100 [SPARK-37404][PYTHON][ML] Inline type hints for pyspark.ml.evaluation.py ### What changes were proposed in this pull request? This PR migrates type `pyspark.ml.evaluation` annotations from stub file to inline type hints. ### Why are the changes needed? Part of ongoing migration of type hints. ### Does this PR introduce _any_ user-facing change? No, ### How was this patch tested? Existing tests. Closes #35403 from zero323/SPARK-37404. Authored-by: zero323 Signed-off-by: zero323 --- python/pyspark/ml/_typing.pyi | 2 + python/pyspark/ml/evaluation.py| 349 - python/pyspark/ml/evaluation.pyi | 277 python/pyspark/ml/tests/typing/test_evaluation.yml | 2 + 4 files changed, 207 insertions(+), 423 deletions(-) diff --git a/python/pyspark/ml/_typing.pyi b/python/pyspark/ml/_typing.pyi index 7862078..12d831f 100644 --- a/python/pyspark/ml/_typing.pyi +++ b/python/pyspark/ml/_typing.pyi @@ -71,6 +71,8 @@ MultilabelClassificationEvaluatorMetricType = Union[ Literal["microF1Measure"], ] ClusteringEvaluatorMetricType = Literal["silhouette"] +ClusteringEvaluatorDistanceMeasureType = Union[Literal["squaredEuclidean"], Literal["cosine"]] + RankingEvaluatorMetricType = Union[ Literal["meanAveragePrecision"], Literal["meanAveragePrecisionAtK"], diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py index be63a8f..ff0e5b9 100644 --- a/python/pyspark/ml/evaluation.py +++ b/python/pyspark/ml/evaluation.py @@ -18,6 +18,8 @@ import sys from abc import abstractmethod, ABCMeta +from typing import Any, Dict, Optional, TYPE_CHECKING + from pyspark import since, keyword_only from pyspark.ml.wrapper import JavaParams from pyspark.ml.param import Param, Params, TypeConverters @@ -31,6 +33,20 @@ from pyspark.ml.param.shared import ( ) from pyspark.ml.common import inherit_doc from pyspark.ml.util import JavaMLReadable, JavaMLWritable +from pyspark.sql.dataframe import DataFrame + +if TYPE_CHECKING: +from pyspark.ml._typing import ( +ParamMap, +BinaryClassificationEvaluatorMetricType, +ClusteringEvaluatorDistanceMeasureType, +ClusteringEvaluatorMetricType, +MulticlassClassificationEvaluatorMetricType, +MultilabelClassificationEvaluatorMetricType, +RankingEvaluatorMetricType, +RegressionEvaluatorMetricType, +) + __all__ = [ "Evaluator", @@ -54,7 +70,7 @@ class Evaluator(Params, metaclass=ABCMeta): pass @abstractmethod -def _evaluate(self, dataset): +def _evaluate(self, dataset: DataFrame) -> float: """ Evaluates the output. @@ -70,7 +86,7 @@ class Evaluator(Params, metaclass=ABCMeta): """ raise NotImplementedError() -def evaluate(self, dataset, params=None): +def evaluate(self, dataset: DataFrame, params: Optional["ParamMap"] = None) -> float: """ Evaluates the output with optional parameters. @@ -99,7 +115,7 @@ class Evaluator(Params, metaclass=ABCMeta): raise TypeError("Params must be a param map but got %s." % type(params)) @since("1.5.0") -def isLargerBetter(self): +def isLargerBetter(self) -> bool: """ Indicates whether the metric returned by :py:meth:`evaluate` should be maximized (True, default) or minimized (False). @@ -115,7 +131,7 @@ class JavaEvaluator(JavaParams, Evaluator, metaclass=ABCMeta): implementations. """ -def _evaluate(self, dataset): +def _evaluate(self, dataset: DataFrame) -> float: """ Evaluates the output. @@ -130,16 +146,23 @@ class JavaEvaluator(JavaParams, Evaluator, metaclass=ABCMeta): evaluation metric """ self._transfer_params_to_java() +assert self._java_obj is not None return self._java_obj.evaluate(dataset._jdf) -def isLargerBetter(self): +def isLargerBetter(self) -> bool: self._transfer_params_to_java() +assert self._java_obj is not None return self._java_obj.isLargerBetter() @inherit_doc class BinaryClassificationEvaluator( -JavaEvaluator, HasLabelCol, HasRawPredictionCol, HasWeightCol, JavaMLReadable, JavaMLWritable +JavaEvaluator, +HasLabelCol, +HasRawPredictionCol, +HasWeightCol, +
[spark] branch master updated (cc53a0e -> 8d2e08f)
This is an automated email from the ASF dual-hosted git repository. viirya pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from cc53a0e [SPARK-38144][CORE] Remove unused `spark.storage.safetyFraction` config add 8d2e08f [SPARK-38069][SQL][SS] Improve the calculation of time window No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated (0b35c24 -> 1aa9ef0)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from 0b35c24 Preparing development version 3.1.4-SNAPSHOT add 1aa9ef0 [SPARK-38144][CORE] Remove unused `spark.storage.safetyFraction` config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/internal/config/package.scala | 5 - 1 file changed, 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (f58daa7 -> ba3c8c5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from f58daa7 [SPARK-38142][SQL][TESTS] Move `ArrowColumnVectorSuite` to `org.apache.spark.sql.vectorized` add ba3c8c5 [SPARK-38144][CORE] Remove unused `spark.storage.safetyFraction` config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/internal/config/package.scala | 5 - 1 file changed, 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (43cce92 -> cc53a0e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 43cce92 [SPARK-38124][SQL][SS] Introduce StatefulOpClusteredDistribution and apply to stream-stream join add cc53a0e [SPARK-38144][CORE] Remove unused `spark.storage.safetyFraction` config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/internal/config/package.scala | 5 - 1 file changed, 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5b02a34 -> 43cce92)
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5b02a34 [SPARK-38142][SQL][TESTS] Move `ArrowColumnVectorSuite` to `org.apache.spark.sql.vectorized` add 43cce92 [SPARK-38124][SQL][SS] Introduce StatefulOpClusteredDistribution and apply to stream-stream join No new revisions were added by this update. Summary of changes: .../sql/catalyst/plans/physical/partitioning.scala | 40 ++ .../streaming/StreamingSymmetricHashJoinExec.scala | 4 +-- .../spark/sql/streaming/StreamingJoinSuite.scala | 2 +- 3 files changed, 43 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (adba516 -> f58daa7)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from adba516 [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44 add f58daa7 [SPARK-38142][SQL][TESTS] Move `ArrowColumnVectorSuite` to `org.apache.spark.sql.vectorized` No new revisions were added by this update. Summary of changes: .../spark/sql/{execution => }/vectorized/ArrowColumnVectorSuite.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) rename sql/core/src/test/scala/org/apache/spark/sql/{execution => }/vectorized/ArrowColumnVectorSuite.scala (99%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (899d3bb -> 5b02a34)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 899d3bb [SPARK-34183][SS] DataSource V2: Required distribution and ordering in micro-batch execution add 5b02a34 [SPARK-38142][SQL][TESTS] Move `ArrowColumnVectorSuite` to `org.apache.spark.sql.vectorized` No new revisions were added by this update. Summary of changes: .../spark/sql/{execution => }/vectorized/ArrowColumnVectorSuite.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) rename sql/core/src/test/scala/org/apache/spark/sql/{execution => }/vectorized/ArrowColumnVectorSuite.scala (99%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c69f08f8 -> 899d3bb)
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c69f08f8 [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle add 899d3bb [SPARK-34183][SS] DataSource V2: Required distribution and ordering in micro-batch execution No new revisions were added by this update. Summary of changes: .../write/RequiresDistributionAndOrdering.java | 10 + .../spark/sql/errors/QueryCompilationErrors.scala | 5 + .../sql/connector/catalog/InMemoryTable.scala | 5 +- .../sql/execution/datasources/v2/V2Writes.scala| 46 +++- .../execution/streaming/MicroBatchExecution.scala | 11 +- .../sql/execution/streaming/StreamExecution.scala | 12 +- .../streaming/continuous/ContinuousExecution.scala | 33 ++- .../sources/WriteToMicroBatchDataSource.scala | 20 +- .../WriteDistributionAndOrderingSuite.scala| 294 - 9 files changed, 403 insertions(+), 33 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new adba516 [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44 adba516 is described below commit adba5165a56bd4e7a71fcad77c568c0cbc2e7f97 Author: Jack Richard Buggins AuthorDate: Wed Feb 9 02:28:03 2022 +0900 [SPARK-37934][BUILD][3.2] Upgrade Jetty version to 9.4.44 ### What changes were proposed in this pull request? This pull request updates provides a minor update to the Jetty version from `9.4.43.v20210629` to `9.4.44.v20210927` which is required against branch-3.2 to fully resolve https://issues.apache.org/jira/browse/SPARK-37934 ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/35338, DoS vector is available even within a private or restricted network. The below result is the output of a twistlock scan, which also detects this vulnerability. ``` Source: https://github.com/eclipse/jetty.project/issues/6973 CVE: PRISMA-2021-0182 Sev.: medium Package Name: org.eclipse.jetty_jetty-server Package Ver.: 9.4.43.v20210629 Status: fixed in 9.4.44 Description: org.eclipse.jetty_jetty-server package versions before 9.4.44 are vulnerable to DoS (Denial of Service). Logback-access calls Request.getParameterNames() for request logging. That will force a request body read (if it hasn't been read before) per the servlet. This will now consume resources to read the request body content, which could easily be malicious (in size? in keys? etc), even though the application intentionally didn't read the request body. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Core local ``` $ build/sbt > project core > test ``` * CI Closes #35442 from JackBuggins/branch-3.2. Authored-by: Jack Richard Buggins Signed-off-by: Kousuke Saruta --- pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pom.xml b/pom.xml index bc3f925..8af3d6a 100644 --- a/pom.xml +++ b/pom.xml @@ -138,7 +138,7 @@ 10.14.2.0 1.12.2 1.6.13 -9.4.43.v20210629 +9.4.44.v20210927 4.0.3 0.10.0 2.5.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c69f08f8 [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle c69f08f8 is described below commit c69f08f81042c3ecca4b5dfa5511c1217ae88096 Author: Venkata krishnan Sowrirajan AuthorDate: Tue Feb 8 11:24:15 2022 -0600 [SPARK-34826][SHUFFLE] Adaptively fetch shuffle mergers for push based shuffle ### What changes were proposed in this pull request? Currently shuffle mergers are fetched before the start of the ShuffleMapStage. But for initial stages this can be problematic as shuffle mergers are nothing but unique hosts with shuffle services running which could be very few based on executors and this can cause merge ratio to be low. With this approach, `ShuffleMapTask` query for merger locations if not available and if available and start using this for pushing the blocks. Since partitions are mapped uniquely to a merger location, it should be fine to not push for the earlier set of tasks. This should improve the merge ratio for even initial stages. ### Why are the changes needed? Performance improvement. No new APIs change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests and also has been working in our internal production environment for a while now. Closes #34122 from venkata91/SPARK-34826. Authored-by: Venkata krishnan Sowrirajan Signed-off-by: Mridul Muralidharan gmail.com> --- .../main/scala/org/apache/spark/Dependency.scala | 39 ++-- .../scala/org/apache/spark/MapOutputTracker.scala | 88 +++- .../org/apache/spark/scheduler/DAGScheduler.scala | 86 ++-- .../org/apache/spark/scheduler/StageInfo.scala | 16 +- .../spark/shuffle/ShuffleWriteProcessor.scala | 14 +- .../spark/shuffle/sort/SortShuffleManager.scala| 2 +- .../org/apache/spark/MapOutputTrackerSuite.scala | 30 ++- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 224 +++-- 8 files changed, 440 insertions(+), 59 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/Dependency.scala b/core/src/main/scala/org/apache/spark/Dependency.scala index 8e348ee..fbb92b4 100644 --- a/core/src/main/scala/org/apache/spark/Dependency.scala +++ b/core/src/main/scala/org/apache/spark/Dependency.scala @@ -104,15 +104,17 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( private[this] val numPartitions = rdd.partitions.length - // By default, shuffle merge is enabled for ShuffleDependency if push based shuffle + // By default, shuffle merge is allowed for ShuffleDependency if push based shuffle // is enabled - private[this] var _shuffleMergeEnabled = canShuffleMergeBeEnabled() + private[this] var _shuffleMergeAllowed = canShuffleMergeBeEnabled() - private[spark] def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): Unit = { -_shuffleMergeEnabled = shuffleMergeEnabled + private[spark] def setShuffleMergeAllowed(shuffleMergeAllowed: Boolean): Unit = { +_shuffleMergeAllowed = shuffleMergeAllowed } - def shuffleMergeEnabled : Boolean = _shuffleMergeEnabled + def shuffleMergeEnabled : Boolean = shuffleMergeAllowed && mergerLocs.nonEmpty + + def shuffleMergeAllowed : Boolean = _shuffleMergeAllowed /** * Stores the location of the list of chosen external shuffle services for handling the @@ -124,7 +126,7 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( * Stores the information about whether the shuffle merge is finalized for the shuffle map stage * associated with this shuffle dependency */ - private[this] var _shuffleMergedFinalized: Boolean = false + private[this] var _shuffleMergeFinalized: Boolean = false /** * shuffleMergeId is used to uniquely identify merging process of shuffle @@ -135,31 +137,34 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( def shuffleMergeId: Int = _shuffleMergeId def setMergerLocs(mergerLocs: Seq[BlockManagerId]): Unit = { +assert(shuffleMergeAllowed) this.mergerLocs = mergerLocs } def getMergerLocs: Seq[BlockManagerId] = mergerLocs private[spark] def markShuffleMergeFinalized(): Unit = { -_shuffleMergedFinalized = true +_shuffleMergeFinalized = true + } + + private[spark] def isShuffleMergeFinalizedMarked: Boolean = { +_shuffleMergeFinalized } /** - * Returns true if push-based shuffle is disabled for this stage or empty RDD, - * or if the shuffle merge for this stage is finalized, i.e. the shuffle merge - * results for all partitions are available. + * Returns true if push-based shuffle is disabled or if the shuffle merge for + *
[spark] branch master updated (3d736d9 -> 6115f58)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3d736d9 [SPARK-37412][PYTHON][ML] Inline typehints for pyspark.ml.stat add 6115f58 [MINOR][SQL] Remove redundant array creation in UnsafeRow No new revisions were added by this update. Summary of changes: .../java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37412][PYTHON][ML] Inline typehints for pyspark.ml.stat
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3d736d9 [SPARK-37412][PYTHON][ML] Inline typehints for pyspark.ml.stat 3d736d9 is described below commit 3d736d978cdaf345d81833a6216d41464fa86c69 Author: zero323 AuthorDate: Tue Feb 8 12:46:30 2022 +0100 [SPARK-37412][PYTHON][ML] Inline typehints for pyspark.ml.stat ### What changes were proposed in this pull request? This PR migrates type `pyspark.ml.stat` annotations from stub file to inline type hints. (second take, after issue resulting in reversion of #35401) ### Why are the changes needed? Part of ongoing migration of type hints. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #35437 from zero323/SPARK-37412. Authored-by: zero323 Signed-off-by: zero323 --- python/pyspark/ml/stat.py | 62 ++- python/pyspark/ml/stat.pyi | 73 -- 2 files changed, 41 insertions(+), 94 deletions(-) diff --git a/python/pyspark/ml/stat.py b/python/pyspark/ml/stat.py index 15bb6ca..3b3588c 100644 --- a/python/pyspark/ml/stat.py +++ b/python/pyspark/ml/stat.py @@ -17,12 +17,20 @@ import sys +from typing import Optional, Tuple, TYPE_CHECKING + + from pyspark import since, SparkContext from pyspark.ml.common import _java2py, _py2java -from pyspark.ml.wrapper import JavaWrapper, _jvm +from pyspark.ml.linalg import Matrix, Vector +from pyspark.ml.wrapper import JavaWrapper, _jvm # type: ignore[attr-defined] from pyspark.sql.column import Column, _to_seq +from pyspark.sql.dataframe import DataFrame from pyspark.sql.functions import lit +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject # type: ignore[import] + class ChiSquareTest: """ @@ -37,7 +45,9 @@ class ChiSquareTest: """ @staticmethod -def test(dataset, featuresCol, labelCol, flatten=False): +def test( +dataset: DataFrame, featuresCol: str, labelCol: str, flatten: bool = False +) -> DataFrame: """ Perform a Pearson's independence test using dataset. @@ -95,6 +105,8 @@ class ChiSquareTest: 4.0 """ sc = SparkContext._active_spark_context +assert sc is not None + javaTestObj = _jvm().org.apache.spark.ml.stat.ChiSquareTest args = [_py2java(sc, arg) for arg in (dataset, featuresCol, labelCol, flatten)] return _java2py(sc, javaTestObj.test(*args)) @@ -116,7 +128,7 @@ class Correlation: """ @staticmethod -def corr(dataset, column, method="pearson"): +def corr(dataset: DataFrame, column: str, method: str = "pearson") -> DataFrame: """ Compute the correlation matrix with specified method using dataset. @@ -162,6 +174,8 @@ class Correlation: [ 0.4 , 0.9486... , NaN, 1.]]) """ sc = SparkContext._active_spark_context +assert sc is not None + javaCorrObj = _jvm().org.apache.spark.ml.stat.Correlation args = [_py2java(sc, arg) for arg in (dataset, column, method)] return _java2py(sc, javaCorrObj.corr(*args)) @@ -181,7 +195,7 @@ class KolmogorovSmirnovTest: """ @staticmethod -def test(dataset, sampleCol, distName, *params): +def test(dataset: DataFrame, sampleCol: str, distName: str, *params: float) -> DataFrame: """ Conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and @@ -228,9 +242,11 @@ class KolmogorovSmirnovTest: 0.175 """ sc = SparkContext._active_spark_context +assert sc is not None + javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest dataset = _py2java(sc, dataset) -params = [float(param) for param in params] +params = [float(param) for param in params] # type: ignore[assignment] return _java2py( sc, javaTestObj.test(dataset, sampleCol, distName, _jvm().PythonUtils.toSeq(params)) ) @@ -284,7 +300,7 @@ class Summarizer: @staticmethod @since("2.4.0") -def mean(col, weightCol=None): +def mean(col: Column, weightCol: Optional[Column] = None) -> Column: """ return a column of mean summary """ @@ -292,7 +308,7 @@ class Summarizer: @staticmethod @since("3.0.0") -def sum(col, weightCol=None): +def sum(col: Column, weightCol: Optional[Column] = None) -> Column: """ return a column of sum summary """ @@ -300,7 +316,7 @@ class
[spark] branch master updated: [SPARK-38136][INFRA][TESTS] Update GitHub Action test image and PyArrow dependency
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6b62c30 [SPARK-38136][INFRA][TESTS] Update GitHub Action test image and PyArrow dependency 6b62c30 is described below commit 6b62c30fa686dfca812b19f42e0333a0ddde2791 Author: Dongjoon Hyun AuthorDate: Tue Feb 8 18:53:12 2022 +0900 [SPARK-38136][INFRA][TESTS] Update GitHub Action test image and PyArrow dependency ### What changes were proposed in this pull request? This PR aims to update `GitHub Action` test docker image to make the test environment up-to-date. For example, use `PyArrow 7.0.0` instead of `6.0.0`. In addition, `Python 3.8`'s `PyArrow` installation is also updated together to be consistent. Please note that this aims to upgrade the test infra instead of Spark itself. ### Why are the changes needed? | SW | 20211228 | 20220207 | | - | --- | --- | | OpenJDK | 1.8.0_292 | 1.8.0_312 | | numpy | 1.21.4 | 1.22.2 | | pandas | 1.3.4 | 1.3.5 | | pyarrow | 6.0.0 | 7.0.0 | | scipy | 1.7.2 | 1.8.0 | ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the GitHub Action with new image. - Check the package list in the GitHub Action log. Or check the docker image directly. Closes #35434 from dongjoon-hyun/SPARK-38136. Authored-by: Dongjoon Hyun Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 4529cd9..ae35f50 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -252,7 +252,7 @@ jobs: - name: Install Python packages (Python 3.8) if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) run: | -python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow<7.0.0' pandas scipy xmlrunner +python3.8 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy xmlrunner python3.8 -m pip list # Run the tests. - name: Run tests @@ -287,7 +287,7 @@ jobs: name: "Build modules (${{ format('{0}, {1} job', needs.configure-jobs.outputs.branch, needs.configure-jobs.outputs.type) }}): ${{ matrix.modules }}" runs-on: ubuntu-20.04 container: - image: dongjoon/apache-spark-github-action-image:20211228 + image: dongjoon/apache-spark-github-action-image:20220207 strategy: fail-fast: false matrix: @@ -391,7 +391,7 @@ jobs: name: "Build modules: sparkr" runs-on: ubuntu-20.04 container: - image: dongjoon/apache-spark-github-action-image:20211228 + image: dongjoon/apache-spark-github-action-image:20220207 env: HADOOP_PROFILE: ${{ needs.configure-jobs.outputs.hadoop }} HIVE_PROFILE: hive2.3 @@ -462,7 +462,7 @@ jobs: PYSPARK_DRIVER_PYTHON: python3.9 PYSPARK_PYTHON: python3.9 container: - image: dongjoon/apache-spark-github-action-image:20211228 + image: dongjoon/apache-spark-github-action-image:20220207 steps: - name: Checkout Spark repository uses: actions/checkout@v2 @@ -530,7 +530,7 @@ jobs: # Jinja2 3.0.0+ causes error when building with Sphinx. # See also https://issues.apache.org/jira/browse/SPARK-35375. python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme ipython nbsphinx numpydoc 'jinja2<3.0.0' -python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 'pyarrow<7.0.0' pandas 'plotly>=4.8' +python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' apt-get update -y apt-get install -y ruby ruby-dev Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2e703ae -> 08c851d)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2e703ae [SPARK-38030][SQL] Canonicalization should not remove nullability of AttributeReference dataType add 08c851d [SPARK-37943][SQL] Use error classes in the compilation errors of grouping No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 3 ++ .../spark/sql/errors/QueryCompilationErrors.scala | 4 ++- .../sql/errors/QueryCompilationErrorsSuite.scala | 35 ++ 3 files changed, 41 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org