[spark] branch branch-3.1 updated: [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 71fec19 [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake 71fec19 is described below commit 71fec1903ccc0ad485d8479a909ea06b3d7a36bc Author: Holden Karau AuthorDate: Thu Jul 22 15:17:48 2021 +0900 [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake ### What changes were proposed in this pull request? GHA probably doesn't have the same resources as jenkins so move down from 5 to 3 execs and give a bit more time for them to come up. ### Why are the changes needed? Test is timing out in GHA ### Does this PR introduce _any_ user-facing change? No, test only change. ### How was this patch tested? Run through GHA verify no OOM during WorkerDecommissionExtended Closes #33467 from holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA. Lead-authored-by: Holden Karau Co-authored-by: Holden Karau Signed-off-by: Hyukjin Kwon (cherry picked from commit 89a83196ac37617a8d19209ec1d7fea6b52d0f25) Signed-off-by: Hyukjin Kwon --- .../spark/scheduler/WorkerDecommissionExtendedSuite.scala| 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala index 129eb8b..66d3cf2 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala @@ -31,17 +31,17 @@ import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend class WorkerDecommissionExtendedSuite extends SparkFunSuite with LocalSparkContext { private val conf = new org.apache.spark.SparkConf() .setAppName(getClass.getName) -.set(SPARK_MASTER, "local-cluster[5,1,512]") -.set(EXECUTOR_MEMORY, "512m") +.set(SPARK_MASTER, "local-cluster[3,1,384]") +.set(EXECUTOR_MEMORY, "384m") .set(DYN_ALLOCATION_ENABLED, true) .set(DYN_ALLOCATION_SHUFFLE_TRACKING_ENABLED, true) -.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 5) +.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 3) .set(DECOMMISSION_ENABLED, true) test("Worker decommission and executor idle timeout") { sc = new SparkContext(conf.set(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT.key, "10s")) withSpark(sc) { sc => - TestUtils.waitUntilExecutorsUp(sc, 5, 6) + TestUtils.waitUntilExecutorsUp(sc, 3, 8) val rdd1 = sc.parallelize(1 to 10, 2) val rdd2 = rdd1.map(x => (1, x)) val rdd3 = rdd2.reduceByKey(_ + _) @@ -53,10 +53,10 @@ class WorkerDecommissionExtendedSuite extends SparkFunSuite with LocalSparkConte } } - test("Decommission 4 executors from 5 executors in total") { + test("Decommission 2 executors from 3 executors in total") { sc = new SparkContext(conf) withSpark(sc) { sc => - TestUtils.waitUntilExecutorsUp(sc, 5, 6) + TestUtils.waitUntilExecutorsUp(sc, 3, 8) val rdd1 = sc.parallelize(1 to 10, 200) val rdd2 = rdd1.map(x => (x % 100, x)) val rdd3 = rdd2.reduceByKey(_ + _) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new e9dd296 [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake e9dd296 is described below commit e9dd2969c2f599252fb84fac4c239d0055b0ff4e Author: Holden Karau AuthorDate: Thu Jul 22 15:17:48 2021 +0900 [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake ### What changes were proposed in this pull request? GHA probably doesn't have the same resources as jenkins so move down from 5 to 3 execs and give a bit more time for them to come up. ### Why are the changes needed? Test is timing out in GHA ### Does this PR introduce _any_ user-facing change? No, test only change. ### How was this patch tested? Run through GHA verify no OOM during WorkerDecommissionExtended Closes #33467 from holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA. Lead-authored-by: Holden Karau Co-authored-by: Holden Karau Signed-off-by: Hyukjin Kwon (cherry picked from commit 89a83196ac37617a8d19209ec1d7fea6b52d0f25) Signed-off-by: Hyukjin Kwon --- .../spark/scheduler/WorkerDecommissionExtendedSuite.scala| 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala index 129eb8b..66d3cf2 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala @@ -31,17 +31,17 @@ import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend class WorkerDecommissionExtendedSuite extends SparkFunSuite with LocalSparkContext { private val conf = new org.apache.spark.SparkConf() .setAppName(getClass.getName) -.set(SPARK_MASTER, "local-cluster[5,1,512]") -.set(EXECUTOR_MEMORY, "512m") +.set(SPARK_MASTER, "local-cluster[3,1,384]") +.set(EXECUTOR_MEMORY, "384m") .set(DYN_ALLOCATION_ENABLED, true) .set(DYN_ALLOCATION_SHUFFLE_TRACKING_ENABLED, true) -.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 5) +.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 3) .set(DECOMMISSION_ENABLED, true) test("Worker decommission and executor idle timeout") { sc = new SparkContext(conf.set(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT.key, "10s")) withSpark(sc) { sc => - TestUtils.waitUntilExecutorsUp(sc, 5, 6) + TestUtils.waitUntilExecutorsUp(sc, 3, 8) val rdd1 = sc.parallelize(1 to 10, 2) val rdd2 = rdd1.map(x => (1, x)) val rdd3 = rdd2.reduceByKey(_ + _) @@ -53,10 +53,10 @@ class WorkerDecommissionExtendedSuite extends SparkFunSuite with LocalSparkConte } } - test("Decommission 4 executors from 5 executors in total") { + test("Decommission 2 executors from 3 executors in total") { sc = new SparkContext(conf) withSpark(sc) { sc => - TestUtils.waitUntilExecutorsUp(sc, 5, 6) + TestUtils.waitUntilExecutorsUp(sc, 3, 8) val rdd1 = sc.parallelize(1 to 10, 200) val rdd2 = rdd1.map(x => (x % 100, x)) val rdd3 = rdd2.reduceByKey(_ + _) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dcc0aaa -> 89a8319)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dcc0aaa [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex add 89a8319 [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake No new revisions were added by this update. Summary of changes: .../spark/scheduler/WorkerDecommissionExtendedSuite.scala| 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new f83a9ec [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex f83a9ec is described below commit f83a9ec2fd71b6e808b6a70c9fd5d627cfd6ecfe Author: Takuya UESHIN AuthorDate: Wed Jul 21 22:34:04 2021 -0700 [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `add_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `add_categories`. ### How was this patch tested? Added some tests. Closes #33470 from ueshin/issues/SPARK-36214/add_categories. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN (cherry picked from commit dcc0aaa3efb2d441b2dfadb0c64dbc28ee197de5) Signed-off-by: Takuya UESHIN --- .../source/reference/pyspark.pandas/indexing.rst | 1 + .../source/reference/pyspark.pandas/series.rst | 1 + python/pyspark/pandas/categorical.py | 84 -- python/pyspark/pandas/indexes/category.py | 46 python/pyspark/pandas/missing/indexes.py | 1 - .../pyspark/pandas/tests/indexes/test_category.py | 12 python/pyspark/pandas/tests/test_categorical.py| 18 + 7 files changed, 158 insertions(+), 5 deletions(-) diff --git a/python/docs/source/reference/pyspark.pandas/indexing.rst b/python/docs/source/reference/pyspark.pandas/indexing.rst index 4f84d91..b0b4cdd 100644 --- a/python/docs/source/reference/pyspark.pandas/indexing.rst +++ b/python/docs/source/reference/pyspark.pandas/indexing.rst @@ -175,6 +175,7 @@ Categorical components CategoricalIndex.codes CategoricalIndex.categories CategoricalIndex.ordered + CategoricalIndex.add_categories CategoricalIndex.as_ordered CategoricalIndex.as_unordered diff --git a/python/docs/source/reference/pyspark.pandas/series.rst b/python/docs/source/reference/pyspark.pandas/series.rst index b718d79..6243a22 100644 --- a/python/docs/source/reference/pyspark.pandas/series.rst +++ b/python/docs/source/reference/pyspark.pandas/series.rst @@ -401,6 +401,7 @@ the ``Series.cat`` accessor. Series.cat.categories Series.cat.ordered Series.cat.codes + Series.cat.add_categories Series.cat.as_ordered Series.cat.as_unordered diff --git a/python/pyspark/pandas/categorical.py b/python/pyspark/pandas/categorical.py index aeba20d..a83c3c7 100644 --- a/python/pyspark/pandas/categorical.py +++ b/python/pyspark/pandas/categorical.py @@ -14,10 +14,10 @@ # See the License for the specific language governing permissions and # limitations under the License. # -from typing import List, Optional, Union, TYPE_CHECKING, cast +from typing import Any, List, Optional, Union, TYPE_CHECKING, cast import pandas as pd -from pandas.api.types import CategoricalDtype +from pandas.api.types import CategoricalDtype, is_list_like from pyspark.pandas.internal import InternalField from pyspark.sql.types import StructField @@ -165,8 +165,84 @@ class CategoricalAccessor(object): ), ).rename() -def add_categories(self, new_categories: pd.Index, inplace: bool = False) -> "ps.Series": -raise NotImplementedError() +def add_categories( +self, new_categories: Union[pd.Index, Any, List], inplace: bool = False +) -> Optional["ps.Series"]: +""" +Add new categories. + +`new_categories` will be included at the last/highest place in the +categories and will be unused directly after this call. + +Parameters +-- +new_categories : category or list-like of category + The new categories to be included. +inplace : bool, default False + Whether or not to add the categories inplace or return a copy of + this categorical with added categories. + +Returns +--- +Series or None +Categorical with new categories added or None if ``inplace=True``. + +Raises +-- +ValueError +If the new categories include old categories or do not validate as +categories + +Examples + +>>> s = ps.Series(list("abbccc"), dtype="category") +>>> s # doctest: +SKIP +0a +1b +2b +3c +4c +5c +dtype: category +Categories (3, object): ['a', 'b', 'c'] + +>>> s.cat.add_ca
[spark] branch master updated (f3e2957 -> dcc0aaa)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f3e2957 [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package add dcc0aaa [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex No new revisions were added by this update. Summary of changes: .../source/reference/pyspark.pandas/indexing.rst | 1 + .../source/reference/pyspark.pandas/series.rst | 1 + python/pyspark/pandas/categorical.py | 84 -- python/pyspark/pandas/indexes/category.py | 46 python/pyspark/pandas/missing/indexes.py | 1 - .../pyspark/pandas/tests/indexes/test_category.py | 12 python/pyspark/pandas/tests/test_categorical.py| 18 + 7 files changed, 158 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de8e4be -> f3e2957)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de8e4be [SPARK-36063][SQL] Optimize OneRowRelation subqueries add f3e2957 [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package No new revisions were added by this update. Summary of changes: python/pyspark/pandas/__init__.py | 6 ++ 1 file changed, 6 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new c42866e [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package c42866e is described below commit c42866e627f2950934803b0c5f6ec34a5dbc9d70 Author: Hyukjin Kwon AuthorDate: Thu Jul 22 14:21:43 2021 +0900 [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package ### What changes were proposed in this pull request? This PR adds the version that added pandas API on Spark in PySpark documentation. ### Why are the changes needed? To document the version added. ### Does this PR introduce _any_ user-facing change? No to end user. Spark 3.2 is not released yet. ### How was this patch tested? Linter and documentation build. Closes #33473 from HyukjinKwon/SPARK-36253. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit f3e29574d9deb30df4d8d1fbd41a702bf2b15b5f) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/__init__.py | 6 ++ 1 file changed, 6 insertions(+) diff --git a/python/pyspark/pandas/__init__.py b/python/pyspark/pandas/__init__.py index cf078ed..ea8a9ea 100644 --- a/python/pyspark/pandas/__init__.py +++ b/python/pyspark/pandas/__init__.py @@ -14,6 +14,12 @@ # See the License for the specific language governing permissions and # limitations under the License. # + +""" +.. versionadded:: 3.2.0 +pandas API on Spark +""" + import os import sys from distutils.version import LooseVersion - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (d01e532 -> 31bb9e0)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from d01e532 [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script add 31bb9e0 [SPARK-36063][SQL] Optimize OneRowRelation subqueries No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/dsl/package.scala| 7 + .../spark/sql/catalyst/expressions/subquery.scala | 6 +- .../catalyst/optimizer/DecorrelateInnerQuery.scala | 27 +++- .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../spark/sql/catalyst/optimizer/subquery.scala| 45 ++ .../spark/sql/catalyst/plans/QueryPlan.scala | 17 +++ .../org/apache/spark/sql/internal/SQLConf.scala| 8 + .../optimizer/DecorrelateInnerQuerySuite.scala | 20 +-- .../OptimizeOneRowRelationSubquerySuite.scala | 165 + .../scala/org/apache/spark/sql/SubquerySuite.scala | 3 +- 10 files changed, 283 insertions(+), 16 deletions(-) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dcb7db5 -> de8e4be)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dcb7db5 [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation add de8e4be [SPARK-36063][SQL] Optimize OneRowRelation subqueries No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/dsl/package.scala| 7 + .../spark/sql/catalyst/expressions/subquery.scala | 6 +- .../catalyst/optimizer/DecorrelateInnerQuery.scala | 27 +++- .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../spark/sql/catalyst/optimizer/subquery.scala| 45 ++ .../spark/sql/catalyst/plans/QueryPlan.scala | 17 +++ .../org/apache/spark/sql/internal/SQLConf.scala| 8 + .../optimizer/DecorrelateInnerQuerySuite.scala | 20 +-- .../OptimizeOneRowRelationSubquerySuite.scala | 165 + .../scala/org/apache/spark/sql/SubquerySuite.scala | 3 +- 10 files changed, 283 insertions(+), 16 deletions(-) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new d01e532 [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script d01e532 is described below commit d01e53208b75f14be80097b7cd8667bf0122acfc Author: Hyukjin Kwon AuthorDate: Thu Jul 22 11:47:36 2021 +0900 [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script ### What changes were proposed in this pull request? This PR partially backports the fix in the script at https://github.com/apache/spark/pull/33410 to make the branch-3.2 build pass at https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule ### Why are the changes needed? To make the Scala 2.13 periodical job pass ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It is a logically non-conflicting backport. Closes #33472 from HyukjinKwon/SPARK-36251. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- dev/run-tests.py | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/dev/run-tests.py b/dev/run-tests.py index 2f05077..def0948 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -693,15 +693,19 @@ def main(): included_tags = [] excluded_tags = [] if should_only_test_modules: +# We're likely in the forked repository +is_apache_spark_ref = os.environ.get("APACHE_SPARK_REF", "") != "" +# We're likely in the main repo build. +is_github_prev_sha = os.environ.get("GITHUB_PREV_SHA", "") != "" +# Otherwise, we're in either periodic job in Github Actions or somewhere else. + # If we're running the tests in GitHub Actions, attempt to detect and test # only the affected modules. -if test_env == "github_actions": -if "APACHE_SPARK_REF" in os.environ and os.environ["APACHE_SPARK_REF"] != "": -# Fork repository +if test_env == "github_actions" and (is_apache_spark_ref or is_github_prev_sha): +if is_apache_spark_ref: changed_files = identify_changed_files_from_git_commits( "HEAD", target_ref=os.environ["APACHE_SPARK_REF"]) -else: -# Build for each commit. +elif is_github_prev_sha: changed_files = identify_changed_files_from_git_commits( os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"]) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new fef7bf9 [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation fef7bf9 is described below commit fef7bf9fcc22e0e26a97a705b11f14ae36907d0c Author: Kousuke Saruta AuthorDate: Wed Jul 21 19:37:05 2021 -0700 [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation ### What changes were proposed in this pull request? This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`. `1.5.0-3` was released few days ago. This release resolves an issue about buffer size calculation, which can affect usage in Spark. https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3 ### Why are the changes needed? It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3. Authored-by: Kousuke Saruta Signed-off-by: Dongjoon Hyun (cherry picked from commit dcb7db5370cffc1c309671195a325bef7829ec10) Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 2 +- pom.xml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index 818899a..e0a4af0 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -243,4 +243,4 @@ xz/1.8//xz-1.8.jar zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar zookeeper-jute/3.6.2//zookeeper-jute-3.6.2.jar zookeeper/3.6.2//zookeeper-3.6.2.jar -zstd-jni/1.5.0-2//zstd-jni-1.5.0-2.jar +zstd-jni/1.5.0-3//zstd-jni-1.5.0-3.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index bd80eb9..49c362c 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -211,4 +211,4 @@ xz/1.8//xz-1.8.jar zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar zookeeper-jute/3.6.2//zookeeper-jute-3.6.2.jar zookeeper/3.6.2//zookeeper-3.6.2.jar -zstd-jni/1.5.0-2//zstd-jni-1.5.0-2.jar +zstd-jni/1.5.0-3//zstd-jni-1.5.0-3.jar diff --git a/pom.xml b/pom.xml index 3054401..5dc4cd4 100644 --- a/pom.xml +++ b/pom.xml @@ -709,7 +709,7 @@ com.github.luben zstd-jni -1.5.0-2 +1.5.0-3 com.clearspring.analytics - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (09bebc8 -> dcb7db5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 09bebc8 [SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` add dcb7db5 [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 2 +- pom.xml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ad528a0 -> 09bebc8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ad528a0 [SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] updating a bunch of python packages add 09bebc8 [SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` No new revisions were added by this update. Summary of changes: docs/sql-migration-guide.md| 4 .../org/apache/spark/sql/DataFrameReader.scala | 4 ++-- .../sql/execution/datasources/csv/CSVSuite.scala | 22 ++- .../sql/execution/datasources/json/JsonSuite.scala | 25 ++ 4 files changed, 52 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d506815 -> ad528a0)
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d506815 [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex add ad528a0 [SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] updating a bunch of python packages No new revisions were added by this update. Summary of changes: .../files/python_environments/spark-py36-spec.txt | 64 +- 1 file changed, 61 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 24095bf [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex 24095bf is described below commit 24095bfb07f33e5bc8b9d39f2104fc5abb539d31 Author: Takuya UESHIN AuthorDate: Wed Jul 21 11:31:30 2021 -0700 [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add categories setter to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement categories setter in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use categories setter. ### How was this patch tested? Added some tests. Closes #33448 from ueshin/issues/SPARK-36188/categories_setter. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN (cherry picked from commit d506815a92510ff4cbd2c14cf17d41202f3ed609) Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/categorical.py | 18 --- python/pyspark/pandas/indexes/category.py | 16 ++--- .../pyspark/pandas/tests/indexes/test_category.py | 27 ++ python/pyspark/pandas/tests/test_categorical.py| 14 +++ 4 files changed, 69 insertions(+), 6 deletions(-) diff --git a/python/pyspark/pandas/categorical.py b/python/pyspark/pandas/categorical.py index b8cc88c..aeba20d 100644 --- a/python/pyspark/pandas/categorical.py +++ b/python/pyspark/pandas/categorical.py @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. # -from typing import Optional, TYPE_CHECKING, cast +from typing import List, Optional, Union, TYPE_CHECKING, cast import pandas as pd from pandas.api.types import CategoricalDtype @@ -89,8 +89,20 @@ class CategoricalAccessor(object): return self._dtype.categories @categories.setter -def categories(self, categories: pd.Index) -> None: -raise NotImplementedError() +def categories(self, categories: Union[pd.Index, List]) -> None: +dtype = CategoricalDtype(categories, ordered=self.ordered) + +if len(self.categories) != len(dtype.categories): +raise ValueError( +"new categories need to have the same number of items as the old categories!" +) + +internal = self._data._psdf._internal.with_new_spark_column( +self._data._column_label, +self._data.spark.column, +field=self._data._internal.data_fields[0].copy(dtype=dtype), +) +self._data._psdf._update_internal_frame(internal) @property def ordered(self) -> bool: diff --git a/python/pyspark/pandas/indexes/category.py b/python/pyspark/pandas/indexes/category.py index a7ad2a0..1b65886 100644 --- a/python/pyspark/pandas/indexes/category.py +++ b/python/pyspark/pandas/indexes/category.py @@ -15,7 +15,7 @@ # limitations under the License. # from functools import partial -from typing import Any, Optional, cast, no_type_check +from typing import Any, List, Optional, Union, cast, no_type_check import pandas as pd from pandas.api.types import is_hashable, CategoricalDtype @@ -174,8 +174,18 @@ class CategoricalIndex(Index): return self.dtype.categories @categories.setter -def categories(self, categories: pd.Index) -> None: -raise NotImplementedError() +def categories(self, categories: Union[pd.Index, List]) -> None: +dtype = CategoricalDtype(categories, ordered=self.ordered) + +if len(self.categories) != len(dtype.categories): +raise ValueError( +"new categories need to have the same number of items as the old categories!" +) + +internal = self._psdf._internal.copy( +index_fields=[self._internal.index_fields[0].copy(dtype=dtype)] +) +self._psdf._update_internal_frame(internal) @property def ordered(self) -> bool: diff --git a/python/pyspark/pandas/tests/indexes/test_category.py b/python/pyspark/pandas/tests/indexes/test_category.py index 02752ec..d04f896 100644 --- a/python/pyspark/pandas/tests/indexes/test_category.py +++ b/python/pyspark/pandas/tests/indexes/test_category.py @@ -67,6 +67,33 @@ class CategoricalIndexTest(PandasOnSparkTestCase, TestUtils): self.assert_eq(psidx.codes, pd.Index(pidx.codes)) self.assert_eq(psidx.ordered, pidx.ordered) +def test_categories_setter(self): +pdf = pd.DataFrame( +{ +"a": pd.Categorical([1, 2, 3, 1, 2, 3]), +
[spark] branch master updated (4cd6cfc -> d506815)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4cd6cfc [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec add d506815 [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex No new revisions were added by this update. Summary of changes: python/pyspark/pandas/categorical.py | 18 --- python/pyspark/pandas/indexes/category.py | 16 ++--- .../pyspark/pandas/tests/indexes/test_category.py | 27 ++ python/pyspark/pandas/tests/test_categorical.py| 14 +++ 4 files changed, 69 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI interval types
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 468165a [SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI interval types 468165a is described below commit 468165ae52aa788e3fa59f385225b90c616bfa0f Author: Kousuke Saruta AuthorDate: Wed Jul 21 20:54:18 2021 +0300 [SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI interval types ### What changes were proposed in this pull request? This PR changes `BaseScriptTransformationExec` for `SparkScriptTransformationExec` to support ANSI interval types. ### Why are the changes needed? `SparkScriptTransformationExec` support `CalendarIntervalType` so it's better to support ANSI interval types as well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Authored-by: Kousuke Saruta Signed-off-by: Max Gekk (cherry picked from commit f56c7b71ff27e6f5379f3699c2dcb5f79a0ae791) Signed-off-by: Max Gekk Closes #33463 from MaxGekk/sarutak_script-transformation-interval-3.2. Authored-by: Kousuke Saruta Signed-off-by: Max Gekk --- .../execution/BaseScriptTransformationExec.scala| 8 .../execution/BaseScriptTransformationSuite.scala | 21 + 2 files changed, 29 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala index 7835981..e249cd6 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala @@ -223,6 +223,14 @@ trait BaseScriptTransformationExec extends UnaryExecNode { case CalendarIntervalType => wrapperConvertException( data => IntervalUtils.stringToInterval(UTF8String.fromString(data)), converter) + case YearMonthIntervalType(start, end) => wrapperConvertException( +data => IntervalUtils.monthsToPeriod( + IntervalUtils.castStringToYMInterval(UTF8String.fromString(data), start, end)), +converter) + case DayTimeIntervalType(start, end) => wrapperConvertException( +data => IntervalUtils.microsToDuration( + IntervalUtils.castStringToDTInterval(UTF8String.fromString(data), start, end)), +converter) case _: ArrayType | _: MapType | _: StructType => val complexTypeFactory = JsonToStructs(attr.dataType, ioschema.outputSerdeProps.toMap, Literal(null), Some(conf.sessionLocalTimeZone)) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala index c845dd81..9d8fcda 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala @@ -633,6 +633,27 @@ abstract class BaseScriptTransformationSuite extends SparkPlanTest with SQLTestU } } } + + test("SPARK-36208: TRANSFORM should support ANSI interval (no serde)") { +assume(TestUtils.testCommandAvailable("python")) +withTempView("v") { + val df = Seq( +(Period.of(1, 2, 0), Duration.ofDays(1).plusHours(2).plusMinutes(3).plusSeconds(4)) + ).toDF("ym", "dt") + + checkAnswer( +df, +(child: SparkPlan) => createScriptTransformationExec( + script = "cat", + output = Seq( +AttributeReference("ym", YearMonthIntervalType())(), +AttributeReference("dt", DayTimeIntervalType())()), + child = child, + ioschema = defaultIOSchema +), +df.select($"ym", $"dt").collect()) +} + } } case class ExceptionInjectingOperator(child: SparkPlan) extends UnaryExecNode { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 99eb3ff [SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2 99eb3ff is described below commit 99eb3ff226dd62691211db0ed9184195e702e20c Author: Gengliang Wang AuthorDate: Wed Jul 21 09:55:09 2021 -0700 [SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2 ### What changes were proposed in this pull request? Remove TimestampNTZ type support in the production code of Spark 3.2. To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches: - keyword "timestamp_ntz" and "timestamp_ltz" in parser - New expressions from https://issues.apache.org/jira/browse/SPARK-35662 - Using java.time.localDateTime as the external type for TimestampNTZType - `SQLConf.timestampType` which determines the default timestamp type of Spark SQL. This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.2. The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this. ### Why are the changes needed? As of now, there are some blockers for delivering the TimestampNTZ project in Spark 3.2: - In the Hive Thrift server, both TimestampType and TimestampNTZType are mapped to the same timestamp type, which can cause confusion for users. - For the Parquet data source, the new written TimestampNTZType Parquet columns will be read as TimestampType in old Spark releases. Also, we need to decide the merge schema for files mixed with TimestampType and TimestampNTZ type. - The type coercion rules for TimestampNTZType are incomplete. For example, what should the data type of the in clause "IN(Timestamp'2020-01-01 00:00:00', TimestampNtz'2020-01-01 00:00:00') be. - It is tricky to support TimestampNTZType in JSON/CSV data readers. We need to avoid regressions as possible as we can. There are 10 days left for the expected 3.2 RC date. So, I propose to **release the TimestampNTZ type in Spark 3.3 instead of Spark 3.2**. So that we have enough time to make considerate designs for the issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing Unit tests + manual tests from spark-shell to validate the changes are gone. New functions ``` spark.sql("select to_timestamp_ntz'2021-01-01 00:00:00'").show() spark.sql("select to_timestamp_ltz'2021-01-01 00:00:00'").show() spark.sql("select make_timestamp_ntz(1,1,1,1,1,1)").show() spark.sql("select make_timestamp_ltz(1,1,1,1,1,1)").show() spark.sql("select localtimestamp()").show() ``` The SQL configuration `spark.sql.timestampType` should not work in 3.2 ``` spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ") spark.sql("select make_timestamp(1,1,1,1,1,1)").schema spark.sql("select to_timestamp('2021-01-01 00:00:00')").schema spark.sql("select timestamp'2021-01-01 00:00:00'").schema Seq((1, java.sql.Timestamp.valueOf("2021-01-01 00:00:00"))).toDF("i", "ts").write.partitionBy("ts").parquet("/tmp/test") spark.read.parquet("/tmp/test").schema ``` LocalDateTime is not supported as a built-in external type: ``` Seq(LocalDateTime.now()).toDF() org.apache.spark.sql.catalyst.expressions.Literal(java.time.LocalDateTime.now()) org.apache.spark.sql.catalyst.expressions.Literal(0L, TimestampNTZType) ``` Closes #33444 from gengliangwang/banNTZ. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/sql/Encoders.scala | 8 --- .../sql/catalyst/CatalystTypeConverters.scala | 4 +++- .../spark/sql/catalyst/JavaTypeInference.scala | 11 ++--- .../spark/sql/catalyst/ScalaReflection.scala | 12 +++--- .../sql/catalyst/analysis/FunctionRegistry.scala | 22 - .../apache/spark/sql/catalyst/dsl/package.scala| 4 .../spark/sql/catalyst/encoders/RowEncoder.scala | 10 +--- .../spark/sql/catalyst/expressions/literals.scala | 14 +++ .../spark/sql/catalyst/parser/AstBuilder.scala | 11 + .../org/apache/spark/sql/internal/SQLConf.scala| 15 +++- .../scala/org/apache/spark/sql/SQLImplicits.scala | 3 --- .../org/apache/spark/sql/JavaDatasetSuite.java | 8 --- .../apache/spark/sql/DataFrameAggregateSuite.scala | 13 +- .../scala/org/apache/spark/sql/DatasetSuite.scala | 5 .../test/scala/org/apache/spark/sql/UDFSuite.scala | 28 -- 15 files changed, 70 inserti
[spark] branch branch-3.0 updated (78ec701 -> 0cfea30)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 78ec701 [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables add 0cfea30 [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec No new revisions were added by this update. Summary of changes: .../spark/sql/execution/command/tables.scala | 7 - .../test/resources/sql-tests/inputs/describe.sql | 2 ++ .../resources/sql-tests/results/describe.sql.out | 33 +- 3 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new f818a9f [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec f818a9f is described below commit f818a9f43bd6e36fffdf788f2dfc13cd01873d02 Author: Kent Yao AuthorDate: Thu Jul 22 00:52:31 2021 +0800 [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec ### What changes were proposed in this pull request? This fixes a case sensitivity issue for desc table commands with partition specified. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, but it's a bugfix ### How was this patch tested? new tests before ``` +-- !query +DESC EXTENDED t PARTITION (C='Us', D=1) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +Partition spec is invalid. The spec (C, D) must match the partition spec (c, d) defined in table '`default`.`t`' + ``` after https://github.com/apache/spark/pull/33424/files#diff-554189c49950974a948f99fa9b7436f615052511660c6a0ae3062fa8ca0a327cR328 Closes #33424 from yaooqinn/SPARK-36213. Authored-by: Kent Yao Signed-off-by: Kent Yao (cherry picked from commit 4cd6cfc773da726a90d41bfc590ea9188c17d5ae) Signed-off-by: Kent Yao --- .../spark/sql/execution/command/tables.scala | 7 - .../test/resources/sql-tests/inputs/describe.sql | 2 ++ .../resources/sql-tests/results/describe.sql.out | 33 +- 3 files changed, 40 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index f88200e..b557fd2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -693,7 +693,12 @@ case class DescribeTableCommand( s"DESC PARTITION is not allowed on a view: ${table.identifier}") } DDLUtils.verifyPartitionProviderIsHive(spark, metadata, "DESC PARTITION") -val partition = catalog.getPartition(table, partitionSpec) +val normalizedPartSpec = PartitioningUtils.normalizePartitionSpec( + partitionSpec, + metadata.partitionSchema, + table.quotedString, + spark.sessionState.conf.resolver) +val partition = catalog.getPartition(table, normalizedPartSpec) if (isExtended) describeFormattedDetailedPartitionInfo(table, metadata, partition, result) } diff --git a/sql/core/src/test/resources/sql-tests/inputs/describe.sql b/sql/core/src/test/resources/sql-tests/inputs/describe.sql index a0ee932..deff5bb 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/describe.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/describe.sql @@ -43,6 +43,8 @@ DESC EXTENDED t PARTITION (c='Us', d=1); DESC FORMATTED t PARTITION (c='Us', d=1); +DESC EXTENDED t PARTITION (C='Us', D=1); + -- NoSuchPartitionException: Partition not found in table DESC t PARTITION (c='Us', d=2); diff --git a/sql/core/src/test/resources/sql-tests/results/describe.sql.out b/sql/core/src/test/resources/sql-tests/results/describe.sql.out index b0d650f..c14d312 100644 --- a/sql/core/src/test/resources/sql-tests/results/describe.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/describe.sql.out @@ -1,5 +1,5 @@ -- Automatically generated by SQLQueryTestSuite --- Number of queries: 41 +-- Number of queries: 42 -- !query @@ -325,6 +325,37 @@ Storage Properties [a=1, b=2] -- !query +DESC EXTENDED t PARTITION (C='Us', D=1) +-- !query schema +struct +-- !query output +a string +b int +c string +d string +# Partition Information +# col_name data_type comment +c string +d string + +# Detailed Partition Information +Database default +Table t +Partition Values [c=Us, d=1] +Location [not included in c
[spark] branch branch-3.2 updated (1ce678b -> 7d36373)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from 1ce678b [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables add 7d36373 [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec No new revisions were added by this update. Summary of changes: .../spark/sql/execution/command/tables.scala | 7 - .../test/resources/sql-tests/inputs/describe.sql | 2 ++ .../resources/sql-tests/results/describe.sql.out | 33 +- 3 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (685c3fd -> 4cd6cfc)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 685c3fd [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables add 4cd6cfc [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec No new revisions were added by this update. Summary of changes: .../spark/sql/execution/command/tables.scala | 7 - .../test/resources/sql-tests/inputs/describe.sql | 2 ++ .../resources/sql-tests/results/describe.sql.out | 33 +- 3 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 1ce678b [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables 1ce678b is described below commit 1ce678b2aa4d3701b3e743c3657685942c9abfcd Author: Shardul Mahadik AuthorDate: Wed Jul 21 22:40:39 2021 +0800 [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables ### What changes were proposed in this pull request? For non-datasource Hive tables, e.g. tables written outside of Spark (through Hive or Trino), we have certain optimzations in Spark where we use Spark ORC and Parquet datasources to read these tables ([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L128)) rather than using the Hive serde. If such a table contains a `path` property, Spark will try to list this path property in addition to the table location when creating an `InMemoryFileIndex`. ([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575)) This can lead to wrong data if `path` property points to a directory location or an error if `path` is not a location. A concrete example is provided in [S [...] Since these tables were not written through Spark, Spark should not interpret this `path` property as it can be set by an external system with a different meaning. ### Why are the changes needed? For better compatibility with Hive tables generated by other platforms (non-Spark) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test Closes #33328 from shardulm94/spark-28266. Authored-by: Shardul Mahadik Signed-off-by: Wenchen Fan (cherry picked from commit 685c3fd05bf8e9d85ea9b33d4e28807d436cd5ca) Signed-off-by: Wenchen Fan --- .../spark/sql/hive/HiveMetastoreCatalog.scala | 4 +- .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++ 2 files changed, 50 insertions(+), 1 deletion(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala index e02589e..05aa648 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala @@ -244,7 +244,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log paths = rootPath.toString :: Nil, userSpecifiedSchema = Option(updatedTable.dataSchema), bucketSpec = None, -options = options, +// Do not interpret the 'path' option at all when tables are read using the Hive +// source, since the URIs will already have been read from the table's LOCATION. +options = options.filter { case (k, _) => !k.equalsIgnoreCase("path") }, className = fileType).resolveRelation(), table = updatedTable) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala index 1a6f684..af3d455 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala @@ -363,4 +363,51 @@ class DataSourceWithHiveMetastoreCatalogSuite } }) } + + Seq( +"parquet" -> ( + "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", + HiveUtils.CONVERT_METASTORE_PARQUET.key), +"orc" -> ( + "org.apache.hadoop.hive.ql.io.orc.OrcSerde", + HiveUtils.CONVERT_METASTORE_ORC.key) + ).foreach { case (format, (serde, formatConvertConf)) => +test("SPARK-28266: convertToLogicalRelation should not interpret `path` property when " + + s"reading Hive tables using $format file format") { + withTempPath(dir => { +val baseDir = dir.getAbsolutePath +withSQLConf(formatConvertConf -> "true") { + + withTable("t1") { +hiveClient.runSqlHive( + s""" + |CREATE TABLE t1 (id bigint) + |ROW FORMAT SERDE '$serde' + |WITH SERDEPROPERTIES ('path'='someNonLocationValue') + |STORED AS $format LOCATION '$baseDir' + |""".stripMargin) + +assertResult(0) { +
[spark] branch branch-3.0 updated (25caec4 -> 78ec701)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 25caec4 [SPARK-35027][CORE] Close the inputStream in FileAppender when writin… add 78ec701 [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables No new revisions were added by this update. Summary of changes: .../spark/sql/hive/HiveMetastoreCatalog.scala | 4 +- .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++ 2 files changed, 50 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new b3ca63c [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables b3ca63c is described below commit b3ca63c3ed5e8c9728738ab6b1bc143c5d0d6219 Author: Shardul Mahadik AuthorDate: Wed Jul 21 22:40:39 2021 +0800 [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables ### What changes were proposed in this pull request? For non-datasource Hive tables, e.g. tables written outside of Spark (through Hive or Trino), we have certain optimzations in Spark where we use Spark ORC and Parquet datasources to read these tables ([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L128)) rather than using the Hive serde. If such a table contains a `path` property, Spark will try to list this path property in addition to the table location when creating an `InMemoryFileIndex`. ([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575)) This can lead to wrong data if `path` property points to a directory location or an error if `path` is not a location. A concrete example is provided in [S [...] Since these tables were not written through Spark, Spark should not interpret this `path` property as it can be set by an external system with a different meaning. ### Why are the changes needed? For better compatibility with Hive tables generated by other platforms (non-Spark) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test Closes #33328 from shardulm94/spark-28266. Authored-by: Shardul Mahadik Signed-off-by: Wenchen Fan (cherry picked from commit 685c3fd05bf8e9d85ea9b33d4e28807d436cd5ca) Signed-off-by: Wenchen Fan --- .../spark/sql/hive/HiveMetastoreCatalog.scala | 4 +- .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++ 2 files changed, 50 insertions(+), 1 deletion(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala index a89243c..c67bc7d 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala @@ -244,7 +244,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log paths = rootPath.toString :: Nil, userSpecifiedSchema = Option(updatedTable.dataSchema), bucketSpec = None, -options = options, +// Do not interpret the 'path' option at all when tables are read using the Hive +// source, since the URIs will already have been read from the table's LOCATION. +options = options.filter { case (k, _) => !k.equalsIgnoreCase("path") }, className = fileType).resolveRelation(), table = updatedTable) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala index 1a6f684..af3d455 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala @@ -363,4 +363,51 @@ class DataSourceWithHiveMetastoreCatalogSuite } }) } + + Seq( +"parquet" -> ( + "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", + HiveUtils.CONVERT_METASTORE_PARQUET.key), +"orc" -> ( + "org.apache.hadoop.hive.ql.io.orc.OrcSerde", + HiveUtils.CONVERT_METASTORE_ORC.key) + ).foreach { case (format, (serde, formatConvertConf)) => +test("SPARK-28266: convertToLogicalRelation should not interpret `path` property when " + + s"reading Hive tables using $format file format") { + withTempPath(dir => { +val baseDir = dir.getAbsolutePath +withSQLConf(formatConvertConf -> "true") { + + withTable("t1") { +hiveClient.runSqlHive( + s""" + |CREATE TABLE t1 (id bigint) + |ROW FORMAT SERDE '$serde' + |WITH SERDEPROPERTIES ('path'='someNonLocationValue') + |STORED AS $format LOCATION '$baseDir' + |""".stripMargin) + +assertResult(0) { +
[spark] branch master updated (9c8a3d3 -> 685c3fd)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 9c8a3d3 [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed add 685c3fd [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables No new revisions were added by this update. Summary of changes: .../spark/sql/hive/HiveMetastoreCatalog.scala | 4 +- .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++ 2 files changed, 50 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new f4291e3 [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed f4291e3 is described below commit f4291e373ee6e80456a42711072a75659bf1e2b5 Author: Wenchen Fan AuthorDate: Wed Jul 21 22:17:56 2021 +0800 [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed ### What changes were proposed in this pull request? Sometimes, AQE skew join optimization can fail with NPE. This is because AQE tries to get the shuffle block sizes, but some map outputs are missing due to the executor lost or something. This PR fixes this bug by skipping skew join handling if some map outputs are missing in the `MapOutputTracker`. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT Closes #33445 from cloud-fan/bug. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit 9c8a3d3975fab1e21d9482ed327919f9904e25df) Signed-off-by: Wenchen Fan --- .../execution/adaptive/ShufflePartitionsUtil.scala | 9 -- .../sql/execution/ShufflePartitionsUtilSuite.scala | 33 -- 2 files changed, 38 insertions(+), 4 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala index 837764b..4ef7d33 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala @@ -362,11 +362,15 @@ object ShufflePartitionsUtil extends Logging { } /** - * Get the map size of the specific reduce shuffle Id. + * Get the map size of the specific shuffle and reduce ID. Note that, some map outputs can be + * missing due to issues like executor lost. The size will be -1 for missing map outputs and the + * caller side should take care of it. */ private def getMapSizesForReduceId(shuffleId: Int, partitionId: Int): Array[Long] = { val mapOutputTracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster] - mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses.map{_.getSizeForBlock(partitionId)} +mapOutputTracker.shuffleStatuses(shuffleId).withMapStatuses(_.map { stat => + if (stat == null) -1 else stat.getSizeForBlock(partitionId) +}) } /** @@ -378,6 +382,7 @@ object ShufflePartitionsUtil extends Logging { reducerId: Int, targetSize: Long): Option[Seq[PartialReducerPartitionSpec]] = { val mapPartitionSizes = getMapSizesForReduceId(shuffleId, reducerId) +if (mapPartitionSizes.exists(_ < 0)) return None val mapStartIndices = splitSizeListByTargetSize(mapPartitionSizes, targetSize) if (mapStartIndices.length > 1) { Some(mapStartIndices.indices.map { i => diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala index a38caa7..9f70c8a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala @@ -17,10 +17,12 @@ package org.apache.spark.sql.execution -import org.apache.spark.{MapOutputStatistics, SparkFunSuite} +import org.apache.spark.{LocalSparkContext, MapOutputStatistics, MapOutputTrackerMaster, SparkConf, SparkContext, SparkEnv, SparkFunSuite} +import org.apache.spark.scheduler.MapStatus import org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil +import org.apache.spark.storage.BlockManagerId -class ShufflePartitionsUtilSuite extends SparkFunSuite { +class ShufflePartitionsUtilSuite extends SparkFunSuite with LocalSparkContext { private def checkEstimation( bytesByPartitionIdArray: Array[Array[Long]], @@ -765,4 +767,31 @@ class ShufflePartitionsUtilSuite extends SparkFunSuite { targetSize, 1, 0) assert(coalesced == Seq(expected1, expected2)) } + + test("SPARK-36228: Skip splitting a skewed partition when some map outputs are removed") { +sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]")) +val mapOutputTracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster] +mapOutputTracker.registerShuffle(shuffleId = 10, numMaps = 2, numReduces = 1) +mapOutputTracker.registerMapOutput(shuffleId = 10, mapIndex = 0, MapStatus( + BlockMa
[spark] branch master updated (f56c7b7 -> 9c8a3d3)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f56c7b7 [SPARK-36208][SQL] SparkScriptTransformation should support ANSI interval types add 9c8a3d3 [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed No new revisions were added by this update. Summary of changes: .../execution/adaptive/ShufflePartitionsUtil.scala | 9 -- .../sql/execution/ShufflePartitionsUtilSuite.scala | 33 -- 2 files changed, 38 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (94aece4 -> f56c7b7)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 94aece4 [SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should retain the LOGICAL_PLAN_TAG tag add f56c7b7 [SPARK-36208][SQL] SparkScriptTransformation should support ANSI interval types No new revisions were added by this update. Summary of changes: .../execution/BaseScriptTransformationExec.scala| 8 .../execution/BaseScriptTransformationSuite.scala | 21 + 2 files changed, 29 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (64aee34 -> 06520b2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from 64aee34 [SPARK-36153][SQL][DOCS] Update transform doc to match the current code add 06520b2 [SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL No new revisions were added by this update. Summary of changes: docs/sql-data-sources-parquet.md | 65 1 file changed, 65 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (b5c0f6c -> 64aee34)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from b5c0f6c [SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should retain the LOGICAL_PLAN_TAG tag add 64aee34 [SPARK-36153][SQL][DOCS] Update transform doc to match the current code No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-qry-select-transform.md | 101 1 file changed, 88 insertions(+), 13 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org