[spark] branch master updated (74a6b9d -> 95a5c17)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 74a6b9d [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark add 95a5c17 Revert "[SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730" No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 187aa6a [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark 187aa6a is described below commit 187aa6ab7f3766ce8ebe6dd22c19b20e9bf043c3 Author: Hyukjin Kwon AuthorDate: Sat Jul 31 08:31:10 2021 +0900 [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33570, which mistakenly changed the default value of the default index ### Why are the changes needed? It was mistakenly changed. It was changed to check if the tests actually pass but I forgot to change it back. ### Does this PR introduce _any_ user-facing change? No, it's not related yet. It fixes up the mistake of the default value mistakenly changed. (Changed default value makes the test flaky because of the order affected by extra shuffle) ### How was this patch tested? Manually tested. Closes #33596 from HyukjinKwon/SPARK-36338-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 74a6b9d23b0280e07561714018ee7b7f31d2a2f3) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/config.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/pandas/config.py b/python/pyspark/pandas/config.py index 10acc8c..b03f2e1 100644 --- a/python/pyspark/pandas/config.py +++ b/python/pyspark/pandas/config.py @@ -175,7 +175,7 @@ _options = [ Option( key="compute.default_index_type", doc=("This sets the default index type: sequence, distributed and distributed-sequence."), -default="distributed-sequence", +default="sequence", types=str, check_func=( lambda v: v in ("sequence", "distributed", "distributed-sequence"), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0e65ed5 -> 74a6b9d)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0e65ed5 [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730 add 74a6b9d [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark No new revisions were added by this update. Summary of changes: python/pyspark/pandas/config.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 586f898 [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730 586f898 is described below commit 586f89886fc8029d35aaa11021a4a88909c85804 Author: Dongjoon Hyun AuthorDate: Sat Jul 31 07:20:17 2021 +0900 [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730 ### What changes were proposed in this pull request? This PR aims to upgrade PySpark GitHub Action job to use the latest docker image `20210730` having `sklearn` and `mlflow` additionally. - https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/commit/5ca94453d1108dfe40bceb8872387a1b19b0c783 ``` $ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep mlflow mlflow1.19.0 $ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep sklearn sklearn 0.0 ``` ### Why are the changes needed? This will save the installation time. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action PySpark jobs. Closes #33595 from dongjoon-hyun/SPARK-36345. Authored-by: Dongjoon Hyun Signed-off-by: Hyukjin Kwon (cherry picked from commit 0e65ed5fb9c62671789a651a993abbb9f546367c) Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 3eb12f5..d247e6b 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -149,7 +149,7 @@ jobs: name: "Build modules: ${{ matrix.modules }}" runs-on: ubuntu-20.04 container: - image: dongjoon/apache-spark-github-action-image:20210602 + image: dongjoon/apache-spark-github-action-image:20210730 strategy: fail-fast: false matrix: @@ -227,8 +227,6 @@ jobs: # Run the tests. - name: Run tests run: | -# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of the base image -python3.9 -m pip install 'mlflow>=1.0' sklearn export PATH=$PATH:$HOME/miniconda/bin ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload test results to report - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0e65ed5 [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730 0e65ed5 is described below commit 0e65ed5fb9c62671789a651a993abbb9f546367c Author: Dongjoon Hyun AuthorDate: Sat Jul 31 07:20:17 2021 +0900 [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730 ### What changes were proposed in this pull request? This PR aims to upgrade PySpark GitHub Action job to use the latest docker image `20210730` having `sklearn` and `mlflow` additionally. - https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/commit/5ca94453d1108dfe40bceb8872387a1b19b0c783 ``` $ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep mlflow mlflow1.19.0 $ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep sklearn sklearn 0.0 ``` ### Why are the changes needed? This will save the installation time. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action PySpark jobs. Closes #33595 from dongjoon-hyun/SPARK-36345. Authored-by: Dongjoon Hyun Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 17908ff..58487a4 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -186,7 +186,7 @@ jobs: name: "Build modules: ${{ matrix.modules }}" runs-on: ubuntu-20.04 container: - image: dongjoon/apache-spark-github-action-image:20210602 + image: dongjoon/apache-spark-github-action-image:20210730 strategy: fail-fast: false matrix: @@ -252,8 +252,6 @@ jobs: # Run the tests. - name: Run tests run: | -# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of the base image -python3.9 -m pip install 'mlflow>=1.0' sklearn export PATH=$PATH:$HOME/miniconda/bin ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload test results to report - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new f9f5656 [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec f9f5656 is described below commit f9f5656491c1edbfbc9f8c13840aaa935c49037f Author: Andy Grove AuthorDate: Fri Jul 30 13:21:50 2021 -0500 [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec ### What changes were proposed in this pull request? Changes in this PR: - `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and `doExecuteColumnar` to support adaptive queries where the final query stage produces columnar data. - `SessionState` now has a new set of injectable rules named `finalQueryStagePrepRules` that can be applied to the final query stage. - `AdaptiveSparkPlanExec` can now safely be wrapped by either `RowToColumnarExec` or `ColumnarToRowExec`. A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` transition that is inserted by previous rules and at execution time can call `finalPlanSupportsColumnar` to see if the final query stage is columnar. If the plan is columnar then the plugin can safely call `doExecuteColumnar`. The adaptive plan can be wrapped in either `RowToColumnarExec` or `ColumnarToRowExec` to force a particular output format. There are fast paths in both of these operators to avoid any re [...] ### Why are the changes needed? Without this change it is necessary to use reflection to get the final physical plan to determine whether it is columnar and to execute it is a columnar plan. `AdaptiveSparkPlanExec` only provides public methods for row-based execution. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I have manually tested this patch with the RAPIDS Accelerator for Apache Spark. Closes #33140 from andygrove/support-columnar-adaptive. Authored-by: Andy Grove Signed-off-by: Thomas Graves (cherry picked from commit 0f538402fb76e4d6182cc881219d53b5fdf73af1) Signed-off-by: Thomas Graves --- .../apache/spark/sql/SparkSessionExtensions.scala | 22 +++- .../org/apache/spark/sql/execution/Columnar.scala | 134 - .../execution/adaptive/AdaptiveSparkPlanExec.scala | 46 +-- .../sql/internal/BaseSessionStateBuilder.scala | 7 +- .../apache/spark/sql/internal/SessionState.scala | 3 +- 5 files changed, 141 insertions(+), 71 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala index b14dce6..18ebae5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala @@ -47,6 +47,7 @@ import org.apache.spark.sql.execution.{ColumnarRule, SparkPlan} * (External) Catalog listeners. * Columnar Rules. * Adaptive Query Stage Preparation Rules. + * Adaptive Query Post Stage Preparation Rules. * * * The extensions can be used by calling `withExtensions` on the [[SparkSession.Builder]], for @@ -110,9 +111,12 @@ class SparkSessionExtensions { type TableFunctionDescription = (FunctionIdentifier, ExpressionInfo, TableFunctionBuilder) type ColumnarRuleBuilder = SparkSession => ColumnarRule type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan] + type PostStageCreationRuleBuilder = SparkSession => Rule[SparkPlan] private[this] val columnarRuleBuilders = mutable.Buffer.empty[ColumnarRuleBuilder] private[this] val queryStagePrepRuleBuilders = mutable.Buffer.empty[QueryStagePrepRuleBuilder] + private[this] val postStageCreationRuleBuilders = +mutable.Buffer.empty[PostStageCreationRuleBuilder] /** * Build the override rules for columnar execution. @@ -129,6 +133,14 @@ class SparkSessionExtensions { } /** + * Build the override rules for the final query stage preparation phase of adaptive query + * execution. + */ + private[sql] def buildPostStageCreationRules(session: SparkSession): Seq[Rule[SparkPlan]] = { +postStageCreationRuleBuilders.map(_.apply(session)).toSeq + } + + /** * Inject a rule that can override the columnar execution of an executor. */ def injectColumnar(builder: ColumnarRuleBuilder): Unit = { @@ -136,13 +148,21 @@ class SparkSessionExtensions { } /** - * Inject a rule that can override the the query stage preparation phase of adaptive query + * Inject a rule that can override the query stage preparation phase of adaptive query * execution. */ def injectQueryStagePrepRule(builder: Quer
[spark] branch master updated: [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0f53840 [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec 0f53840 is described below commit 0f538402fb76e4d6182cc881219d53b5fdf73af1 Author: Andy Grove AuthorDate: Fri Jul 30 13:21:50 2021 -0500 [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec ### What changes were proposed in this pull request? Changes in this PR: - `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and `doExecuteColumnar` to support adaptive queries where the final query stage produces columnar data. - `SessionState` now has a new set of injectable rules named `finalQueryStagePrepRules` that can be applied to the final query stage. - `AdaptiveSparkPlanExec` can now safely be wrapped by either `RowToColumnarExec` or `ColumnarToRowExec`. A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` transition that is inserted by previous rules and at execution time can call `finalPlanSupportsColumnar` to see if the final query stage is columnar. If the plan is columnar then the plugin can safely call `doExecuteColumnar`. The adaptive plan can be wrapped in either `RowToColumnarExec` or `ColumnarToRowExec` to force a particular output format. There are fast paths in both of these operators to avoid any re [...] ### Why are the changes needed? Without this change it is necessary to use reflection to get the final physical plan to determine whether it is columnar and to execute it is a columnar plan. `AdaptiveSparkPlanExec` only provides public methods for row-based execution. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I have manually tested this patch with the RAPIDS Accelerator for Apache Spark. Closes #33140 from andygrove/support-columnar-adaptive. Authored-by: Andy Grove Signed-off-by: Thomas Graves --- .../apache/spark/sql/SparkSessionExtensions.scala | 22 +++- .../org/apache/spark/sql/execution/Columnar.scala | 134 - .../execution/adaptive/AdaptiveSparkPlanExec.scala | 46 +-- .../sql/internal/BaseSessionStateBuilder.scala | 7 +- .../apache/spark/sql/internal/SessionState.scala | 3 +- 5 files changed, 141 insertions(+), 71 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala index b14dce6..18ebae5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala @@ -47,6 +47,7 @@ import org.apache.spark.sql.execution.{ColumnarRule, SparkPlan} * (External) Catalog listeners. * Columnar Rules. * Adaptive Query Stage Preparation Rules. + * Adaptive Query Post Stage Preparation Rules. * * * The extensions can be used by calling `withExtensions` on the [[SparkSession.Builder]], for @@ -110,9 +111,12 @@ class SparkSessionExtensions { type TableFunctionDescription = (FunctionIdentifier, ExpressionInfo, TableFunctionBuilder) type ColumnarRuleBuilder = SparkSession => ColumnarRule type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan] + type PostStageCreationRuleBuilder = SparkSession => Rule[SparkPlan] private[this] val columnarRuleBuilders = mutable.Buffer.empty[ColumnarRuleBuilder] private[this] val queryStagePrepRuleBuilders = mutable.Buffer.empty[QueryStagePrepRuleBuilder] + private[this] val postStageCreationRuleBuilders = +mutable.Buffer.empty[PostStageCreationRuleBuilder] /** * Build the override rules for columnar execution. @@ -129,6 +133,14 @@ class SparkSessionExtensions { } /** + * Build the override rules for the final query stage preparation phase of adaptive query + * execution. + */ + private[sql] def buildPostStageCreationRules(session: SparkSession): Seq[Rule[SparkPlan]] = { +postStageCreationRuleBuilders.map(_.apply(session)).toSeq + } + + /** * Inject a rule that can override the columnar execution of an executor. */ def injectColumnar(builder: ColumnarRuleBuilder): Unit = { @@ -136,13 +148,21 @@ class SparkSessionExtensions { } /** - * Inject a rule that can override the the query stage preparation phase of adaptive query + * Inject a rule that can override the query stage preparation phase of adaptive query * execution. */ def injectQueryStagePrepRule(builder: QueryStagePrepRuleBuilder): Unit = { queryStagePrepRuleBuilders += builder } + /** + * Inject a rule that
[spark] branch branch-3.2 updated: [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new a4dcda1 [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps a4dcda1 is described below commit a4dcda179448cc7632f5022e26be3cafb3d82505 Author: Takuya UESHIN AuthorDate: Fri Jul 30 11:19:49 2021 -0700 [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps ### What changes were proposed in this pull request? Move some logic related to `F.nanvl` to `DataTypeOps`. ### Why are the changes needed? There are several places to branch by `FloatType` or `DoubleType` to use `F.nanvl` but `DataTypeOps` should handle it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33582 from ueshin/issues/SPARK-36350/nan_to_null. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN (cherry picked from commit 895e3f5e2aff46f5c4eed8ac9ddf4dbfb16ef5fd) Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/data_type_ops/base.py| 3 ++ python/pyspark/pandas/data_type_ops/num_ops.py | 11 + python/pyspark/pandas/frame.py | 50 +++ python/pyspark/pandas/generic.py | 66 -- python/pyspark/pandas/groupby.py | 66 +- python/pyspark/pandas/series.py| 22 +++-- 6 files changed, 112 insertions(+), 106 deletions(-) diff --git a/python/pyspark/pandas/data_type_ops/base.py b/python/pyspark/pandas/data_type_ops/base.py index 7eb2a95..743b2c5 100644 --- a/python/pyspark/pandas/data_type_ops/base.py +++ b/python/pyspark/pandas/data_type_ops/base.py @@ -366,5 +366,8 @@ class DataTypeOps(object, metaclass=ABCMeta): ), ) +def nan_to_null(self, index_ops: IndexOpsLike) -> IndexOpsLike: +return index_ops.copy() + def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: raise TypeError("astype can not be applied to %s." % self.pretty_name) diff --git a/python/pyspark/pandas/data_type_ops/num_ops.py b/python/pyspark/pandas/data_type_ops/num_ops.py index a7987bc..f84c1af 100644 --- a/python/pyspark/pandas/data_type_ops/num_ops.py +++ b/python/pyspark/pandas/data_type_ops/num_ops.py @@ -326,6 +326,14 @@ class FractionalOps(NumericOps): ), ) +def nan_to_null(self, index_ops: IndexOpsLike) -> IndexOpsLike: +# Special handle floating point types because Spark's count treats nan as a valid value, +# whereas pandas count doesn't include nan. +return index_ops._with_new_scol( +F.nanvl(index_ops.spark.column, SF.lit(None)), +field=index_ops._internal.data_fields[0].copy(nullable=True), +) + def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: dtype, spark_type = pandas_on_spark_type(dtype) @@ -385,6 +393,9 @@ class DecimalOps(FractionalOps): ), ) +def nan_to_null(self, index_ops: IndexOpsLike) -> IndexOpsLike: +return index_ops.copy() + def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: # TODO(SPARK-36230): check index_ops.hasnans after fixing SPARK-36230 dtype, spark_type = pandas_on_spark_type(dtype) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index de675f1f..4a737b6 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -642,7 +642,7 @@ class DataFrame(Frame, Generic[T]): def _reduce_for_stat_function( self, -sfun: Union[Callable[[Column], Column], Callable[[Column, DataType], Column]], +sfun: Callable[["Series"], Column], name: str, axis: Optional[Axis] = None, numeric_only: bool = True, @@ -664,7 +664,6 @@ class DataFrame(Frame, Generic[T]): is mainly for pandas compatibility. Only 'DataFrame.count' uses this parameter currently. """ -from inspect import signature from pyspark.pandas.series import Series, first_series axis = validate_axis(axis) @@ -673,29 +672,19 @@ class DataFrame(Frame, Generic[T]): exprs = [SF.lit(None).cast(StringType()).alias(SPARK_DEFAULT_INDEX_NAME)] new_column_labels = [] -num_args = len(signature(sfun).parameters) for label in self._internal.column_labels: -spark_column = self._internal.spark_column_for(label) -spark_type = self._internal.spark_type_for(label) +psser = self._psser_for(label) -is_numeric_or_boo
[spark] branch master updated (801017f -> 895e3f5)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 801017f [SPARK-36358][K8S] Upgrade Kubernetes Client Version to 5.6.0 add 895e3f5 [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps No new revisions were added by this update. Summary of changes: python/pyspark/pandas/data_type_ops/base.py| 3 ++ python/pyspark/pandas/data_type_ops/num_ops.py | 11 + python/pyspark/pandas/frame.py | 50 +++ python/pyspark/pandas/generic.py | 66 -- python/pyspark/pandas/groupby.py | 66 +- python/pyspark/pandas/series.py| 22 +++-- 6 files changed, 112 insertions(+), 106 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c6140d4 -> 801017f)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c6140d4 [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side add 801017f [SPARK-36358][K8S] Upgrade Kubernetes Client Version to 5.6.0 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 42 - dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 42 - pom.xml | 2 +- 3 files changed, 43 insertions(+), 43 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new fee87f1 [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side fee87f1 is described below commit fee87f13d1be2fb018ff200d4f6dbfdb30f03a99 Author: Hyukjin Kwon AuthorDate: Fri Jul 30 22:29:23 2021 +0900 [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side ### What changes were proposed in this pull request? This PR proposes to implement `distributed-sequence` index in Scala side. ### Why are the changes needed? - Avoid unnecessary (de)serialization - Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104) ### Does this PR introduce _any_ user-facing change? No to end users since pandas API on Spark is not released yet. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(1).spark.print_schema() ``` Before: ``` root |-- id: long (nullable = true) ``` After: ``` root |-- id: long (nullable = false) ``` ### How was this patch tested? Manually tested, and existing tests should cover them. Closes #33570 from HyukjinKwon/SPARK-36338. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit c6140d4d0af7f34152c1b681110a077be36f4068) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/accessors.py | 24 +--- python/pyspark/pandas/config.py| 2 +- python/pyspark/pandas/indexes/base.py | 16 +-- python/pyspark/pandas/indexes/multi.py | 2 +- python/pyspark/pandas/indexing.py | 16 +-- python/pyspark/pandas/internal.py | 150 +++-- python/pyspark/pandas/series.py| 12 +- python/pyspark/pandas/tests/test_dataframe.py | 37 ++--- .../main/scala/org/apache/spark/sql/Dataset.scala | 25 .../org/apache/spark/sql/DataFrameSuite.scala | 6 + 10 files changed, 89 insertions(+), 201 deletions(-) diff --git a/python/pyspark/pandas/accessors.py b/python/pyspark/pandas/accessors.py index 6454938..d71ded1 100644 --- a/python/pyspark/pandas/accessors.py +++ b/python/pyspark/pandas/accessors.py @@ -169,7 +169,7 @@ class PandasOnSparkFrameMethods(object): for scol, label in zip(internal.data_spark_columns, internal.column_labels) ] ) -sdf, force_nullable = attach_func(sdf, name_like_string(column)) +sdf = attach_func(sdf, name_like_string(column)) return DataFrame( InternalFrame( @@ -178,28 +178,18 @@ class PandasOnSparkFrameMethods(object): scol_for(sdf, SPARK_INDEX_NAME_FORMAT(i)) for i in range(internal.index_level) ], index_names=internal.index_names, -index_fields=( -[field.copy(nullable=True) for field in internal.index_fields] -if force_nullable -else internal.index_fields -), +index_fields=internal.index_fields, column_labels=internal.column_labels + [column], data_spark_columns=( [scol_for(sdf, name_like_string(label)) for label in internal.column_labels] + [scol_for(sdf, name_like_string(column))] ), -data_fields=( -( -[field.copy(nullable=True) for field in internal.data_fields] -if force_nullable -else internal.data_fields +data_fields=internal.data_fields ++ [ +InternalField.from_struct_field( +StructField(name_like_string(column), LongType(), nullable=False) ) -+ [ -InternalField.from_struct_field( -StructField(name_like_string(column), LongType(), nullable=False) -) -] -), +], column_label_names=internal.column_label_names, ).resolved_copy ) diff --git a/python/pyspark/pandas/config.py b/python/pyspark/pandas/config.py index b03f2e1..10acc8c 100644 --- a/python/pyspark/pandas/config.py +++ b/python/pyspark/pandas/config.py @@ -175,7 +175,7 @@ _options = [ Option( key
[spark] branch master updated (dd2ca0a -> c6140d4)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dd2ca0a [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark add c6140d4 [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side No new revisions were added by this update. Summary of changes: python/pyspark/pandas/accessors.py | 24 +--- python/pyspark/pandas/config.py| 2 +- python/pyspark/pandas/indexes/base.py | 16 +-- python/pyspark/pandas/indexes/multi.py | 2 +- python/pyspark/pandas/indexing.py | 16 +-- python/pyspark/pandas/internal.py | 150 +++-- python/pyspark/pandas/series.py| 12 +- python/pyspark/pandas/tests/test_dataframe.py | 37 ++--- .../main/scala/org/apache/spark/sql/Dataset.scala | 25 .../org/apache/spark/sql/DataFrameSuite.scala | 6 + 10 files changed, 89 insertions(+), 201 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 9cd3708 [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark 9cd3708 is described below commit 9cd370894baafb59d76d269595f6fdbaf94c76b8 Author: Hyukjin Kwon AuthorDate: Fri Jul 30 22:28:19 2021 +0900 [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark ### What changes were proposed in this pull request? This PR is a partial revert of https://github.com/apache/spark/pull/33567 that keeps the logic to skip mlflow related tests if that's not installed. ### Why are the changes needed? It's consistent with other libraries, e.g) PyArrow. It also fixes up the potential dev breakage (see also https://github.com/apache/spark/pull/33567#issuecomment-889841829) ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? This is a partial revert. CI should test it out too. Closes #33589 from HyukjinKwon/SPARK-36254. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit dd2ca0aee25f6da1a1cf36ea8c7e0095420b7c00) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/mlflow.py | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py index 4e48369..719db40 100644 --- a/python/pyspark/pandas/mlflow.py +++ b/python/pyspark/pandas/mlflow.py @@ -229,4 +229,10 @@ def _test() -> None: if __name__ == "__main__": -_test() +try: +import mlflow # noqa: F401 +import sklearn # noqa: F401 + +_test() +except ImportError: +pass - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (387a251 -> dd2ca0a)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 387a251 [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown add dd2ca0a [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark No new revisions were added by this update. Summary of changes: python/pyspark/pandas/mlflow.py | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] tgravescs commented on a change in pull request #350: [SPARK-36335] Add local-cluster docs to developer-tools.md
tgravescs commented on a change in pull request #350: URL: https://github.com/apache/spark-website/pull/350#discussion_r679883614 ## File path: site/developer-tools.html ## @@ -793,6 +793,18 @@ In Spark unit tests The platform-specific paths to the profiler agents are listed in the http://www.yourkit.com/docs/80/help/agent.jsp";>YourKit documentation. + +Local-cluster mode + +When launching applications with spark-submit, besides options in +https://spark.apache.org/docs/latest/submitting-applications.html#master-urls";>Master URLs +, set local-cluster option to emulate a distributed cluster in a single JVM. Review comment: shouldn't we explicitly say this is for unit testing only? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown
This is an automated email from the ASF dual-hosted git repository. viirya pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new f6bb75b [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown f6bb75b is described below commit f6bb75b0bcae8c0bccf361dfd3710ce5f17173d5 Author: Wenchen Fan AuthorDate: Fri Jul 30 00:26:32 2021 -0700 [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33352 , to simplify the JDBC aggregate pushdown: 1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark. 2. because of 1, now we can remove the `dataType` property from the public `Sum` expression. This PR also contains some small improvements: 1. Spark should deduplicate the aggregate expressions before pushing them down. 2. Improve the `toString` of public aggregate expressions to make them more SQL. ### Why are the changes needed? code and API simplification ### Does this PR introduce _any_ user-facing change? this API is not released yet. ### How was this patch tested? existing tests Closes #33579 from cloud-fan/dsv2. Authored-by: Wenchen Fan Signed-off-by: Liang-Chi Hsieh (cherry picked from commit 387a251a682a596ba4156b7d12e6025762ebac85) Signed-off-by: Liang-Chi Hsieh --- .../spark/sql/connector/expressions/Count.java | 8 +- .../spark/sql/connector/expressions/CountStar.java | 2 +- .../spark/sql/connector/expressions/Max.java | 2 +- .../spark/sql/connector/expressions/Min.java | 2 +- .../spark/sql/connector/expressions/Sum.java | 12 +-- .../execution/datasources/DataSourceStrategy.scala | 3 +- .../sql/execution/datasources/jdbc/JDBCRDD.scala | 45 +-- .../execution/datasources/jdbc/JDBCRelation.scala | 7 +- .../execution/datasources/v2/PushDownUtils.scala | 2 +- .../datasources/v2/V2ScanRelationPushDown.scala| 24 -- .../execution/datasources/v2/jdbc/JDBCScan.scala | 12 ++- .../datasources/v2/jdbc/JDBCScanBuilder.scala | 88 ++ .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 30 13 files changed, 121 insertions(+), 116 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java index 0e28a93..fecde71 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java @@ -38,7 +38,13 @@ public final class Count implements AggregateFunc { public boolean isDistinct() { return isDistinct; } @Override - public String toString() { return "Count(" + column.describe() + "," + isDistinct + ")"; } + public String toString() { +if (isDistinct) { + return "COUNT(DISTINCT " + column.describe() + ")"; +} else { + return "COUNT(" + column.describe() + ")"; +} + } @Override public String describe() { return this.toString(); } diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java index 21a3564..8e799cd 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java @@ -31,7 +31,7 @@ public final class CountStar implements AggregateFunc { } @Override - public String toString() { return "CountStar()"; } + public String toString() { return "COUNT(*)"; } @Override public String describe() { return this.toString(); } diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java index d2ff6b2..3ce45ca 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java @@ -33,7 +33,7 @@ public final class Max implements AggregateFunc { public FieldReference column() { return column; } @Override - public String toString() { return "Max(" + column.describe() + ")"; } + public String toString() { return "MAX(" + column.describe() + ")"; } @Override public String describe() { return this.toString(); } diff --git a/sql/catalyst/src/
[spark] branch master updated (abce61f -> 387a251)
This is an automated email from the ASF dual-hosted git repository. viirya pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from abce61f [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI add 387a251 [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown No new revisions were added by this update. Summary of changes: .../spark/sql/connector/expressions/Count.java | 8 +- .../spark/sql/connector/expressions/CountStar.java | 2 +- .../spark/sql/connector/expressions/Max.java | 2 +- .../spark/sql/connector/expressions/Min.java | 2 +- .../spark/sql/connector/expressions/Sum.java | 12 +-- .../execution/datasources/DataSourceStrategy.scala | 3 +- .../sql/execution/datasources/jdbc/JDBCRDD.scala | 45 +-- .../execution/datasources/jdbc/JDBCRelation.scala | 7 +- .../execution/datasources/v2/PushDownUtils.scala | 2 +- .../datasources/v2/V2ScanRelationPushDown.scala| 24 -- .../execution/datasources/v2/jdbc/JDBCScan.scala | 12 ++- .../datasources/v2/jdbc/JDBCScanBuilder.scala | 88 ++ .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 30 13 files changed, 121 insertions(+), 116 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new a9c5b1a5 [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI a9c5b1a5 is described below commit a9c5b1a5c85a584ad866badcb35067713139b0bc Author: itholic AuthorDate: Fri Jul 30 00:04:48 2021 -0700 [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI ### What changes were proposed in this pull request? This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark. ### Why are the changes needed? To enable the MLflow test in pandas API on Spark. ### Does this PR introduce _any_ user-facing change? No, it's test-only ### How was this patch tested? Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`. Closes #33567 from itholic/SPARK-36254. Lead-authored-by: itholic Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun (cherry picked from commit abce61f3fda73e865a80e9c38bf9ca471a6a5db8) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 ++ dev/requirements.txt | 3 ++- python/pyspark/pandas/mlflow.py | 8 +--- 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index cfc20ac..3eb12f5 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -227,6 +227,8 @@ jobs: # Run the tests. - name: Run tests run: | +# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of the base image +python3.9 -m pip install 'mlflow>=1.0' sklearn export PATH=$PATH:$HOME/miniconda/bin ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload test results to report diff --git a/dev/requirements.txt b/dev/requirements.txt index f5d662b..34f4b88 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -7,7 +7,8 @@ pyarrow pandas scipy plotly -mlflow +mlflow>=1.0 +sklearn matplotlib<3.3.0 # PySpark test dependencies diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py index 719db40..4e48369 100644 --- a/python/pyspark/pandas/mlflow.py +++ b/python/pyspark/pandas/mlflow.py @@ -229,10 +229,4 @@ def _test() -> None: if __name__ == "__main__": -try: -import mlflow # noqa: F401 -import sklearn # noqa: F401 - -_test() -except ImportError: -pass +_test() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new abce61f [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI abce61f is described below commit abce61f3fda73e865a80e9c38bf9ca471a6a5db8 Author: itholic AuthorDate: Fri Jul 30 00:04:48 2021 -0700 [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI ### What changes were proposed in this pull request? This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark. ### Why are the changes needed? To enable the MLflow test in pandas API on Spark. ### Does this PR introduce _any_ user-facing change? No, it's test-only ### How was this patch tested? Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`. Closes #33567 from itholic/SPARK-36254. Lead-authored-by: itholic Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 ++ dev/requirements.txt | 3 ++- python/pyspark/pandas/mlflow.py | 8 +--- 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index f3a6363..17908ff 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -252,6 +252,8 @@ jobs: # Run the tests. - name: Run tests run: | +# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of the base image +python3.9 -m pip install 'mlflow>=1.0' sklearn export PATH=$PATH:$HOME/miniconda/bin ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload test results to report diff --git a/dev/requirements.txt b/dev/requirements.txt index f5d662b..34f4b88 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -7,7 +7,8 @@ pyarrow pandas scipy plotly -mlflow +mlflow>=1.0 +sklearn matplotlib<3.3.0 # PySpark test dependencies diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py index 719db40..4e48369 100644 --- a/python/pyspark/pandas/mlflow.py +++ b/python/pyspark/pandas/mlflow.py @@ -229,10 +229,4 @@ def _test() -> None: if __name__ == "__main__": -try: -import mlflow # noqa: F401 -import sklearn # noqa: F401 - -_test() -except ImportError: -pass +_test() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org