date:20210721

[spark] branch branch-3.1 updated: [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 71fec19  [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake
71fec19 is described below

commit 71fec1903ccc0ad485d8479a909ea06b3d7a36bc
Author: Holden Karau 
AuthorDate: Thu Jul 22 15:17:48 2021 +0900

[SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

### What changes were proposed in this pull request?

GHA probably doesn't have the same resources as jenkins so move down from 5 
to 3 execs and give a bit more time for them to come up.

### Why are the changes needed?

Test is timing out in GHA

### Does this PR introduce _any_ user-facing change?
No, test only change.

### How was this patch tested?

Run through GHA verify no OOM during WorkerDecommissionExtended

Closes #33467 from 
holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA.

Lead-authored-by: Holden Karau 
Co-authored-by: Holden Karau 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 89a83196ac37617a8d19209ec1d7fea6b52d0f25)
Signed-off-by: Hyukjin Kwon 
---
 .../spark/scheduler/WorkerDecommissionExtendedSuite.scala| 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
 
b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
index 129eb8b..66d3cf2 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
@@ -31,17 +31,17 @@ import 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend
 class WorkerDecommissionExtendedSuite extends SparkFunSuite with 
LocalSparkContext {
   private val conf = new org.apache.spark.SparkConf()
 .setAppName(getClass.getName)
-.set(SPARK_MASTER, "local-cluster[5,1,512]")
-.set(EXECUTOR_MEMORY, "512m")
+.set(SPARK_MASTER, "local-cluster[3,1,384]")
+.set(EXECUTOR_MEMORY, "384m")
 .set(DYN_ALLOCATION_ENABLED, true)
 .set(DYN_ALLOCATION_SHUFFLE_TRACKING_ENABLED, true)
-.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 5)
+.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 3)
 .set(DECOMMISSION_ENABLED, true)
 
   test("Worker decommission and executor idle timeout") {
 sc = new SparkContext(conf.set(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT.key, 
"10s"))
 withSpark(sc) { sc =>
-  TestUtils.waitUntilExecutorsUp(sc, 5, 6)
+  TestUtils.waitUntilExecutorsUp(sc, 3, 8)
   val rdd1 = sc.parallelize(1 to 10, 2)
   val rdd2 = rdd1.map(x => (1, x))
   val rdd3 = rdd2.reduceByKey(_ + _)
@@ -53,10 +53,10 @@ class WorkerDecommissionExtendedSuite extends SparkFunSuite 
with LocalSparkConte
 }
   }
 
-  test("Decommission 4 executors from 5 executors in total") {
+  test("Decommission 2 executors from 3 executors in total") {
 sc = new SparkContext(conf)
 withSpark(sc) { sc =>
-  TestUtils.waitUntilExecutorsUp(sc, 5, 6)
+  TestUtils.waitUntilExecutorsUp(sc, 3, 8)
   val rdd1 = sc.parallelize(1 to 10, 200)
   val rdd2 = rdd1.map(x => (x % 100, x))
   val rdd3 = rdd2.reduceByKey(_ + _)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new e9dd296  [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake
e9dd296 is described below

commit e9dd2969c2f599252fb84fac4c239d0055b0ff4e
Author: Holden Karau 
AuthorDate: Thu Jul 22 15:17:48 2021 +0900

[SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

### What changes were proposed in this pull request?

GHA probably doesn't have the same resources as jenkins so move down from 5 
to 3 execs and give a bit more time for them to come up.

### Why are the changes needed?

Test is timing out in GHA

### Does this PR introduce _any_ user-facing change?
No, test only change.

### How was this patch tested?

Run through GHA verify no OOM during WorkerDecommissionExtended

Closes #33467 from 
holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA.

Lead-authored-by: Holden Karau 
Co-authored-by: Holden Karau 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 89a83196ac37617a8d19209ec1d7fea6b52d0f25)
Signed-off-by: Hyukjin Kwon 
---
 .../spark/scheduler/WorkerDecommissionExtendedSuite.scala| 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
 
b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
index 129eb8b..66d3cf2 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionExtendedSuite.scala
@@ -31,17 +31,17 @@ import 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend
 class WorkerDecommissionExtendedSuite extends SparkFunSuite with 
LocalSparkContext {
   private val conf = new org.apache.spark.SparkConf()
 .setAppName(getClass.getName)
-.set(SPARK_MASTER, "local-cluster[5,1,512]")
-.set(EXECUTOR_MEMORY, "512m")
+.set(SPARK_MASTER, "local-cluster[3,1,384]")
+.set(EXECUTOR_MEMORY, "384m")
 .set(DYN_ALLOCATION_ENABLED, true)
 .set(DYN_ALLOCATION_SHUFFLE_TRACKING_ENABLED, true)
-.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 5)
+.set(DYN_ALLOCATION_INITIAL_EXECUTORS, 3)
 .set(DECOMMISSION_ENABLED, true)
 
   test("Worker decommission and executor idle timeout") {
 sc = new SparkContext(conf.set(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT.key, 
"10s"))
 withSpark(sc) { sc =>
-  TestUtils.waitUntilExecutorsUp(sc, 5, 6)
+  TestUtils.waitUntilExecutorsUp(sc, 3, 8)
   val rdd1 = sc.parallelize(1 to 10, 2)
   val rdd2 = rdd1.map(x => (1, x))
   val rdd3 = rdd2.reduceByKey(_ + _)
@@ -53,10 +53,10 @@ class WorkerDecommissionExtendedSuite extends SparkFunSuite 
with LocalSparkConte
 }
   }
 
-  test("Decommission 4 executors from 5 executors in total") {
+  test("Decommission 2 executors from 3 executors in total") {
 sc = new SparkContext(conf)
 withSpark(sc) { sc =>
-  TestUtils.waitUntilExecutorsUp(sc, 5, 6)
+  TestUtils.waitUntilExecutorsUp(sc, 3, 8)
   val rdd1 = sc.parallelize(1 to 10, 200)
   val rdd2 = rdd1.map(x => (x % 100, x))
   val rdd3 = rdd2.reduceByKey(_ + _)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dcc0aaa -> 89a8319)

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dcc0aaa  [SPARK-36214][PYTHON] Add add_categories to 
CategoricalAccessor and CategoricalIndex
 add 89a8319  [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

No new revisions were added by this update.

Summary of changes:
 .../spark/scheduler/WorkerDecommissionExtendedSuite.scala| 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new f83a9ec  [SPARK-36214][PYTHON] Add add_categories to 
CategoricalAccessor and CategoricalIndex
f83a9ec is described below

commit f83a9ec2fd71b6e808b6a70c9fd5d627cfd6ecfe
Author: Takuya UESHIN 
AuthorDate: Wed Jul 21 22:34:04 2021 -0700

[SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and 
CategoricalIndex

### What changes were proposed in this pull request?

Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `add_categories` in `CategoricalAccessor` and 
`CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `add_categories`.

### How was this patch tested?

Added some tests.

Closes #33470 from ueshin/issues/SPARK-36214/add_categories.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit dcc0aaa3efb2d441b2dfadb0c64dbc28ee197de5)
Signed-off-by: Takuya UESHIN 
---
 .../source/reference/pyspark.pandas/indexing.rst   |  1 +
 .../source/reference/pyspark.pandas/series.rst |  1 +
 python/pyspark/pandas/categorical.py   | 84 --
 python/pyspark/pandas/indexes/category.py  | 46 
 python/pyspark/pandas/missing/indexes.py   |  1 -
 .../pyspark/pandas/tests/indexes/test_category.py  | 12 
 python/pyspark/pandas/tests/test_categorical.py| 18 +
 7 files changed, 158 insertions(+), 5 deletions(-)

diff --git a/python/docs/source/reference/pyspark.pandas/indexing.rst 
b/python/docs/source/reference/pyspark.pandas/indexing.rst
index 4f84d91..b0b4cdd 100644
--- a/python/docs/source/reference/pyspark.pandas/indexing.rst
+++ b/python/docs/source/reference/pyspark.pandas/indexing.rst
@@ -175,6 +175,7 @@ Categorical components
CategoricalIndex.codes
CategoricalIndex.categories
CategoricalIndex.ordered
+   CategoricalIndex.add_categories
CategoricalIndex.as_ordered
CategoricalIndex.as_unordered
 
diff --git a/python/docs/source/reference/pyspark.pandas/series.rst 
b/python/docs/source/reference/pyspark.pandas/series.rst
index b718d79..6243a22 100644
--- a/python/docs/source/reference/pyspark.pandas/series.rst
+++ b/python/docs/source/reference/pyspark.pandas/series.rst
@@ -401,6 +401,7 @@ the ``Series.cat`` accessor.
Series.cat.categories
Series.cat.ordered
Series.cat.codes
+   Series.cat.add_categories
Series.cat.as_ordered
Series.cat.as_unordered
 
diff --git a/python/pyspark/pandas/categorical.py 
b/python/pyspark/pandas/categorical.py
index aeba20d..a83c3c7 100644
--- a/python/pyspark/pandas/categorical.py
+++ b/python/pyspark/pandas/categorical.py
@@ -14,10 +14,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from typing import List, Optional, Union, TYPE_CHECKING, cast
+from typing import Any, List, Optional, Union, TYPE_CHECKING, cast
 
 import pandas as pd
-from pandas.api.types import CategoricalDtype
+from pandas.api.types import CategoricalDtype, is_list_like
 
 from pyspark.pandas.internal import InternalField
 from pyspark.sql.types import StructField
@@ -165,8 +165,84 @@ class CategoricalAccessor(object):
 ),
 ).rename()
 
-def add_categories(self, new_categories: pd.Index, inplace: bool = False) 
-> "ps.Series":
-raise NotImplementedError()
+def add_categories(
+self, new_categories: Union[pd.Index, Any, List], inplace: bool = False
+) -> Optional["ps.Series"]:
+"""
+Add new categories.
+
+`new_categories` will be included at the last/highest place in the
+categories and will be unused directly after this call.
+
+Parameters
+--
+new_categories : category or list-like of category
+   The new categories to be included.
+inplace : bool, default False
+   Whether or not to add the categories inplace or return a copy of
+   this categorical with added categories.
+
+Returns
+---
+Series or None
+Categorical with new categories added or None if ``inplace=True``.
+
+Raises
+--
+ValueError
+If the new categories include old categories or do not validate as
+categories
+
+Examples
+
+>>> s = ps.Series(list("abbccc"), dtype="category")
+>>> s  # doctest: +SKIP
+0a
+1b
+2b
+3c
+4c
+5c
+dtype: category
+Categories (3, object): ['a', 'b', 'c']
+
+>>> s.cat.add_ca

[spark] branch master updated (f3e2957 -> dcc0aaa)

2021-07-21 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f3e2957  [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of 
pandas-on-Spark package
 add dcc0aaa  [SPARK-36214][PYTHON] Add add_categories to 
CategoricalAccessor and CategoricalIndex

No new revisions were added by this update.

Summary of changes:
 .../source/reference/pyspark.pandas/indexing.rst   |  1 +
 .../source/reference/pyspark.pandas/series.rst |  1 +
 python/pyspark/pandas/categorical.py   | 84 --
 python/pyspark/pandas/indexes/category.py  | 46 
 python/pyspark/pandas/missing/indexes.py   |  1 -
 .../pyspark/pandas/tests/indexes/test_category.py  | 12 
 python/pyspark/pandas/tests/test_categorical.py| 18 +
 7 files changed, 158 insertions(+), 5 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de8e4be -> f3e2957)

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de8e4be  [SPARK-36063][SQL] Optimize OneRowRelation subqueries
 add f3e2957  [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of 
pandas-on-Spark package

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/__init__.py | 6 ++
 1 file changed, 6 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new c42866e  [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of 
pandas-on-Spark package
c42866e is described below

commit c42866e627f2950934803b0c5f6ec34a5dbc9d70
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 22 14:21:43 2021 +0900

[SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark 
package

### What changes were proposed in this pull request?

This PR adds the version that added pandas API on Spark in PySpark 
documentation.

### Why are the changes needed?

To document the version added.

### Does this PR introduce _any_ user-facing change?

No to end user. Spark 3.2 is not released yet.

### How was this patch tested?

Linter and documentation build.

Closes #33473 from HyukjinKwon/SPARK-36253.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit f3e29574d9deb30df4d8d1fbd41a702bf2b15b5f)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/__init__.py | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/python/pyspark/pandas/__init__.py 
b/python/pyspark/pandas/__init__.py
index cf078ed..ea8a9ea 100644
--- a/python/pyspark/pandas/__init__.py
+++ b/python/pyspark/pandas/__init__.py
@@ -14,6 +14,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+
+"""
+.. versionadded:: 3.2.0
+pandas API on Spark
+"""
+
 import os
 import sys
 from distutils.version import LooseVersion

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated (d01e532 -> 31bb9e0)

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d01e532  [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs 
without SHA in testing script
 add 31bb9e0  [SPARK-36063][SQL] Optimize OneRowRelation subqueries

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/dsl/package.scala|   7 +
 .../spark/sql/catalyst/expressions/subquery.scala  |   6 +-
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  27 +++-
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |   1 +
 .../spark/sql/catalyst/optimizer/subquery.scala|  45 ++
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  17 +++
 .../org/apache/spark/sql/internal/SQLConf.scala|   8 +
 .../optimizer/DecorrelateInnerQuerySuite.scala |  20 +--
 .../OptimizeOneRowRelationSubquerySuite.scala  | 165 +
 .../scala/org/apache/spark/sql/SubquerySuite.scala |   3 +-
 10 files changed, 283 insertions(+), 16 deletions(-)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dcb7db5 -> de8e4be)

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dcb7db5  [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a 
bug about buffer size calculation
 add de8e4be  [SPARK-36063][SQL] Optimize OneRowRelation subqueries

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/dsl/package.scala|   7 +
 .../spark/sql/catalyst/expressions/subquery.scala  |   6 +-
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  27 +++-
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |   1 +
 .../spark/sql/catalyst/optimizer/subquery.scala|  45 ++
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  17 +++
 .../org/apache/spark/sql/internal/SQLConf.scala|   8 +
 .../optimizer/DecorrelateInnerQuerySuite.scala |  20 +--
 .../OptimizeOneRowRelationSubquerySuite.scala  | 165 +
 .../scala/org/apache/spark/sql/SubquerySuite.scala |   3 +-
 10 files changed, 283 insertions(+), 16 deletions(-)
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new d01e532  [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs 
without SHA in testing script
d01e532 is described below

commit d01e53208b75f14be80097b7cd8667bf0122acfc
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 22 11:47:36 2021 +0900

[SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in 
testing script

### What changes were proposed in this pull request?

This PR partially backports the fix in the script at 
https://github.com/apache/spark/pull/33410 to make the branch-3.2 build pass at 
https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule

### Why are the changes needed?

To make the Scala 2.13 periodical job pass

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It is a logically non-conflicting backport.

Closes #33472 from HyukjinKwon/SPARK-36251.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/run-tests.py | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/dev/run-tests.py b/dev/run-tests.py
index 2f05077..def0948 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -693,15 +693,19 @@ def main():
 included_tags = []
 excluded_tags = []
 if should_only_test_modules:
+# We're likely in the forked repository
+is_apache_spark_ref = os.environ.get("APACHE_SPARK_REF", "") != ""
+# We're likely in the main repo build.
+is_github_prev_sha = os.environ.get("GITHUB_PREV_SHA", "") != ""
+# Otherwise, we're in either periodic job in Github Actions or 
somewhere else.
+
 # If we're running the tests in GitHub Actions, attempt to detect and 
test
 # only the affected modules.
-if test_env == "github_actions":
-if "APACHE_SPARK_REF" in os.environ and 
os.environ["APACHE_SPARK_REF"] != "":
-# Fork repository
+if test_env == "github_actions" and (is_apache_spark_ref or 
is_github_prev_sha):
+if is_apache_spark_ref:
 changed_files = identify_changed_files_from_git_commits(
 "HEAD", target_ref=os.environ["APACHE_SPARK_REF"])
-else:
-# Build for each commit.
+elif is_github_prev_sha:
 changed_files = identify_changed_files_from_git_commits(
 os.environ["GITHUB_SHA"], 
target_ref=os.environ["GITHUB_PREV_SHA"])
 

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation

2021-07-21 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new fef7bf9  [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a 
bug about buffer size calculation
fef7bf9 is described below

commit fef7bf9fcc22e0e26a97a705b11f14ae36907d0c
Author: Kousuke Saruta 
AuthorDate: Wed Jul 21 19:37:05 2021 -0700

[SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about 
buffer size calculation

### What changes were proposed in this pull request?

This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`.
`1.5.0-3` was released few days ago.
This release resolves an issue about buffer size calculation, which can 
affect usage in Spark.
https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3

### Why are the changes needed?

It might be a corner case that skipping length is greater than `2^31 - 1` 
but it's possible to affect Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3.

Authored-by: Kousuke Saruta 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit dcb7db5370cffc1c309671195a325bef7829ec10)
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 2 +-
 pom.xml | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 818899a..e0a4af0 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -243,4 +243,4 @@ xz/1.8//xz-1.8.jar
 zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar
 zookeeper-jute/3.6.2//zookeeper-jute-3.6.2.jar
 zookeeper/3.6.2//zookeeper-3.6.2.jar
-zstd-jni/1.5.0-2//zstd-jni-1.5.0-2.jar
+zstd-jni/1.5.0-3//zstd-jni-1.5.0-3.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index bd80eb9..49c362c 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -211,4 +211,4 @@ xz/1.8//xz-1.8.jar
 zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar
 zookeeper-jute/3.6.2//zookeeper-jute-3.6.2.jar
 zookeeper/3.6.2//zookeeper-3.6.2.jar
-zstd-jni/1.5.0-2//zstd-jni-1.5.0-2.jar
+zstd-jni/1.5.0-3//zstd-jni-1.5.0-3.jar
diff --git a/pom.xml b/pom.xml
index 3054401..5dc4cd4 100644
--- a/pom.xml
+++ b/pom.xml
@@ -709,7 +709,7 @@
   
 com.github.luben
 zstd-jni
-1.5.0-2
+1.5.0-3
   
   
 com.clearspring.analytics

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (09bebc8 -> dcb7db5)

2021-07-21 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 09bebc8  [SPARK-35912][SQL] Fix nullability of 
`spark.read.json/spark.read.csv`
 add dcb7db5  [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a 
bug about buffer size calculation

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 2 +-
 pom.xml | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ad528a0 -> 09bebc8)

2021-07-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ad528a0  [SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] 
updating a bunch of python packages
 add 09bebc8  [SPARK-35912][SQL] Fix nullability of 
`spark.read.json/spark.read.csv`

No new revisions were added by this update.

Summary of changes:
 docs/sql-migration-guide.md|  4 
 .../org/apache/spark/sql/DataFrameReader.scala |  4 ++--
 .../sql/execution/datasources/csv/CSVSuite.scala   | 22 ++-
 .../sql/execution/datasources/json/JsonSuite.scala | 25 ++
 4 files changed, 52 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d506815 -> ad528a0)

2021-07-21 Thread shaneknapp

This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d506815  [SPARK-36188][PYTHON] Add categories setter to 
CategoricalAccessor and CategoricalIndex
 add ad528a0  [SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] 
updating a bunch of python packages

No new revisions were added by this update.

Summary of changes:
 .../files/python_environments/spark-py36-spec.txt  | 64 +-
 1 file changed, 61 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 24095bf  [SPARK-36188][PYTHON] Add categories setter to 
CategoricalAccessor and CategoricalIndex
24095bf is described below

commit 24095bfb07f33e5bc8b9d39f2104fc5abb539d31
Author: Takuya UESHIN 
AuthorDate: Wed Jul 21 11:31:30 2021 -0700

[SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and 
CategoricalIndex

### What changes were proposed in this pull request?

Add categories setter to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement categories setter in `CategoricalAccessor` and 
`CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use categories setter.

### How was this patch tested?

Added some tests.

Closes #33448 from ueshin/issues/SPARK-36188/categories_setter.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit d506815a92510ff4cbd2c14cf17d41202f3ed609)
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/categorical.py   | 18 ---
 python/pyspark/pandas/indexes/category.py  | 16 ++---
 .../pyspark/pandas/tests/indexes/test_category.py  | 27 ++
 python/pyspark/pandas/tests/test_categorical.py| 14 +++
 4 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/pandas/categorical.py 
b/python/pyspark/pandas/categorical.py
index b8cc88c..aeba20d 100644
--- a/python/pyspark/pandas/categorical.py
+++ b/python/pyspark/pandas/categorical.py
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from typing import Optional, TYPE_CHECKING, cast
+from typing import List, Optional, Union, TYPE_CHECKING, cast
 
 import pandas as pd
 from pandas.api.types import CategoricalDtype
@@ -89,8 +89,20 @@ class CategoricalAccessor(object):
 return self._dtype.categories
 
 @categories.setter
-def categories(self, categories: pd.Index) -> None:
-raise NotImplementedError()
+def categories(self, categories: Union[pd.Index, List]) -> None:
+dtype = CategoricalDtype(categories, ordered=self.ordered)
+
+if len(self.categories) != len(dtype.categories):
+raise ValueError(
+"new categories need to have the same number of items as the 
old categories!"
+)
+
+internal = self._data._psdf._internal.with_new_spark_column(
+self._data._column_label,
+self._data.spark.column,
+field=self._data._internal.data_fields[0].copy(dtype=dtype),
+)
+self._data._psdf._update_internal_frame(internal)
 
 @property
 def ordered(self) -> bool:
diff --git a/python/pyspark/pandas/indexes/category.py 
b/python/pyspark/pandas/indexes/category.py
index a7ad2a0..1b65886 100644
--- a/python/pyspark/pandas/indexes/category.py
+++ b/python/pyspark/pandas/indexes/category.py
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 from functools import partial
-from typing import Any, Optional, cast, no_type_check
+from typing import Any, List, Optional, Union, cast, no_type_check
 
 import pandas as pd
 from pandas.api.types import is_hashable, CategoricalDtype
@@ -174,8 +174,18 @@ class CategoricalIndex(Index):
 return self.dtype.categories
 
 @categories.setter
-def categories(self, categories: pd.Index) -> None:
-raise NotImplementedError()
+def categories(self, categories: Union[pd.Index, List]) -> None:
+dtype = CategoricalDtype(categories, ordered=self.ordered)
+
+if len(self.categories) != len(dtype.categories):
+raise ValueError(
+"new categories need to have the same number of items as the 
old categories!"
+)
+
+internal = self._psdf._internal.copy(
+index_fields=[self._internal.index_fields[0].copy(dtype=dtype)]
+)
+self._psdf._update_internal_frame(internal)
 
 @property
 def ordered(self) -> bool:
diff --git a/python/pyspark/pandas/tests/indexes/test_category.py 
b/python/pyspark/pandas/tests/indexes/test_category.py
index 02752ec..d04f896 100644
--- a/python/pyspark/pandas/tests/indexes/test_category.py
+++ b/python/pyspark/pandas/tests/indexes/test_category.py
@@ -67,6 +67,33 @@ class CategoricalIndexTest(PandasOnSparkTestCase, TestUtils):
 self.assert_eq(psidx.codes, pd.Index(pidx.codes))
 self.assert_eq(psidx.ordered, pidx.ordered)
 
+def test_categories_setter(self):
+pdf = pd.DataFrame(
+{
+"a": pd.Categorical([1, 2, 3, 1, 2, 3]),
+

[spark] branch master updated (4cd6cfc -> d506815)

2021-07-21 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4cd6cfc  [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table 
Command with PartitionSpec
 add d506815  [SPARK-36188][PYTHON] Add categories setter to 
CategoricalAccessor and CategoricalIndex

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/categorical.py   | 18 ---
 python/pyspark/pandas/indexes/category.py  | 16 ++---
 .../pyspark/pandas/tests/indexes/test_category.py  | 27 ++
 python/pyspark/pandas/tests/test_categorical.py| 14 +++
 4 files changed, 69 insertions(+), 6 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI interval types

2021-07-21 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 468165a  [SPARK-36208][SQL][3.2] SparkScriptTransformation should 
support ANSI interval types
468165a is described below

commit 468165ae52aa788e3fa59f385225b90c616bfa0f
Author: Kousuke Saruta 
AuthorDate: Wed Jul 21 20:54:18 2021 +0300

[SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI 
interval types

### What changes were proposed in this pull request?

This PR changes `BaseScriptTransformationExec` for 
`SparkScriptTransformationExec` to support ANSI interval types.

### Why are the changes needed?

`SparkScriptTransformationExec` support `CalendarIntervalType` so it's 
better to support ANSI interval types as well.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Authored-by: Kousuke Saruta 
Signed-off-by: Max Gekk 
(cherry picked from commit f56c7b71ff27e6f5379f3699c2dcb5f79a0ae791)
Signed-off-by: Max Gekk 

Closes #33463 from MaxGekk/sarutak_script-transformation-interval-3.2.

Authored-by: Kousuke Saruta 
Signed-off-by: Max Gekk 
---
 .../execution/BaseScriptTransformationExec.scala|  8 
 .../execution/BaseScriptTransformationSuite.scala   | 21 +
 2 files changed, 29 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
index 7835981..e249cd6 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
@@ -223,6 +223,14 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
   case CalendarIntervalType => wrapperConvertException(
 data => IntervalUtils.stringToInterval(UTF8String.fromString(data)),
 converter)
+  case YearMonthIntervalType(start, end) => wrapperConvertException(
+data => IntervalUtils.monthsToPeriod(
+  IntervalUtils.castStringToYMInterval(UTF8String.fromString(data), 
start, end)),
+converter)
+  case DayTimeIntervalType(start, end) => wrapperConvertException(
+data => IntervalUtils.microsToDuration(
+  IntervalUtils.castStringToDTInterval(UTF8String.fromString(data), 
start, end)),
+converter)
   case _: ArrayType | _: MapType | _: StructType =>
 val complexTypeFactory = JsonToStructs(attr.dataType,
   ioschema.outputSerdeProps.toMap, Literal(null), 
Some(conf.sessionLocalTimeZone))
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala
index c845dd81..9d8fcda 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala
@@ -633,6 +633,27 @@ abstract class BaseScriptTransformationSuite extends 
SparkPlanTest with SQLTestU
   }
 }
   }
+
+  test("SPARK-36208: TRANSFORM should support ANSI interval (no serde)") {
+assume(TestUtils.testCommandAvailable("python"))
+withTempView("v") {
+  val df = Seq(
+(Period.of(1, 2, 0), 
Duration.ofDays(1).plusHours(2).plusMinutes(3).plusSeconds(4))
+  ).toDF("ym", "dt")
+
+  checkAnswer(
+df,
+(child: SparkPlan) => createScriptTransformationExec(
+  script = "cat",
+  output = Seq(
+AttributeReference("ym", YearMonthIntervalType())(),
+AttributeReference("dt", DayTimeIntervalType())()),
+  child = child,
+  ioschema = defaultIOSchema
+),
+df.select($"ym", $"dt").collect())
+}
+  }
 }
 
 case class ExceptionInjectingOperator(child: SparkPlan) extends UnaryExecNode {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2

2021-07-21 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 99eb3ff  [SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in 
Spark 3.2
99eb3ff is described below

commit 99eb3ff226dd62691211db0ed9184195e702e20c
Author: Gengliang Wang 
AuthorDate: Wed Jul 21 09:55:09 2021 -0700

[SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2

### What changes were proposed in this pull request?

Remove TimestampNTZ type support in the production code of Spark 3.2.
To archive the goal, this PR adds the check "Utils.isTesting" in the 
following code branches:
- keyword "timestamp_ntz" and "timestamp_ltz" in parser
- New expressions from https://issues.apache.org/jira/browse/SPARK-35662
- Using java.time.localDateTime as the external type for TimestampNTZType
- `SQLConf.timestampType` which determines the default timestamp type of 
Spark SQL.

This is to minimize the code difference between the master branch. So that 
future users won't think TimestampNTZ is already available in Spark 3.2.
The downside is that users can still find TimestampNTZType under package 
`org.apache.spark.sql.types`. There should be nothing left other than this.
### Why are the changes needed?

As of now, there are some blockers for delivering the TimestampNTZ project 
in Spark 3.2:

- In the Hive Thrift server, both TimestampType and TimestampNTZType are 
mapped to the same timestamp type, which can cause confusion for users.
- For the Parquet data source, the new written TimestampNTZType Parquet 
columns will be read as TimestampType in old Spark releases. Also, we need to 
decide the merge schema for files mixed with TimestampType and TimestampNTZ 
type.
- The type coercion rules for TimestampNTZType are incomplete. For example, 
what should the data type of the in clause "IN(Timestamp'2020-01-01 00:00:00', 
TimestampNtz'2020-01-01 00:00:00') be.
- It is tricky to support TimestampNTZType in JSON/CSV data readers. We 
need to avoid regressions as possible as we can.

There are 10 days left for the expected 3.2 RC date. So, I propose to 
**release the TimestampNTZ type in Spark 3.3 instead of Spark 3.2**. So that we 
have enough time to make considerate designs for the issues.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing Unit tests + manual tests from spark-shell to validate the changes 
are gone.
New functions
```
spark.sql("select to_timestamp_ntz'2021-01-01 00:00:00'").show()
spark.sql("select to_timestamp_ltz'2021-01-01 00:00:00'").show()
spark.sql("select make_timestamp_ntz(1,1,1,1,1,1)").show()
spark.sql("select make_timestamp_ltz(1,1,1,1,1,1)").show()
spark.sql("select localtimestamp()").show()
```
The SQL configuration `spark.sql.timestampType` should not work in 3.2
```
spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ")
spark.sql("select make_timestamp(1,1,1,1,1,1)").schema
spark.sql("select to_timestamp('2021-01-01 00:00:00')").schema
spark.sql("select timestamp'2021-01-01 00:00:00'").schema
Seq((1, java.sql.Timestamp.valueOf("2021-01-01 00:00:00"))).toDF("i", 
"ts").write.partitionBy("ts").parquet("/tmp/test")
spark.read.parquet("/tmp/test").schema
```
LocalDateTime is not supported as a built-in external type:
```
Seq(LocalDateTime.now()).toDF()

org.apache.spark.sql.catalyst.expressions.Literal(java.time.LocalDateTime.now())
org.apache.spark.sql.catalyst.expressions.Literal(0L, TimestampNTZType)
```

Closes #33444 from gengliangwang/banNTZ.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/sql/Encoders.scala |  8 ---
 .../sql/catalyst/CatalystTypeConverters.scala  |  4 +++-
 .../spark/sql/catalyst/JavaTypeInference.scala | 11 ++---
 .../spark/sql/catalyst/ScalaReflection.scala   | 12 +++---
 .../sql/catalyst/analysis/FunctionRegistry.scala   | 22 -
 .../apache/spark/sql/catalyst/dsl/package.scala|  4 
 .../spark/sql/catalyst/encoders/RowEncoder.scala   | 10 +---
 .../spark/sql/catalyst/expressions/literals.scala  | 14 +++
 .../spark/sql/catalyst/parser/AstBuilder.scala | 11 +
 .../org/apache/spark/sql/internal/SQLConf.scala| 15 +++-
 .../scala/org/apache/spark/sql/SQLImplicits.scala  |  3 ---
 .../org/apache/spark/sql/JavaDatasetSuite.java |  8 ---
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 13 +-
 .../scala/org/apache/spark/sql/DatasetSuite.scala  |  5 
 .../test/scala/org/apache/spark/sql/UDFSuite.scala | 28 --
 15 files changed, 70 inserti

[spark] branch branch-3.0 updated (78ec701 -> 0cfea30)

2021-07-21 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 78ec701  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables
 add 0cfea30  [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table 
Command with PartitionSpec

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/command/tables.scala   |  7 -
 .../test/resources/sql-tests/inputs/describe.sql   |  2 ++
 .../resources/sql-tests/results/describe.sql.out   | 33 +-
 3 files changed, 40 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec

2021-07-21 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new f818a9f  [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table 
Command with PartitionSpec
f818a9f is described below

commit f818a9f43bd6e36fffdf788f2dfc13cd01873d02
Author: Kent Yao 
AuthorDate: Thu Jul 22 00:52:31 2021 +0800

[SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with 
PartitionSpec

### What changes were proposed in this pull request?

This fixes a case sensitivity issue for desc table commands with partition 
specified.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

yes, but it's a bugfix

### How was this patch tested?

new tests

 before
```
+-- !query
+DESC EXTENDED t PARTITION (C='Us', D=1)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+Partition spec is invalid. The spec (C, D) must match the partition spec 
(c, d) defined in table '`default`.`t`'
+
```

 after


https://github.com/apache/spark/pull/33424/files#diff-554189c49950974a948f99fa9b7436f615052511660c6a0ae3062fa8ca0a327cR328

Closes #33424 from yaooqinn/SPARK-36213.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 4cd6cfc773da726a90d41bfc590ea9188c17d5ae)
Signed-off-by: Kent Yao 
---
 .../spark/sql/execution/command/tables.scala   |  7 -
 .../test/resources/sql-tests/inputs/describe.sql   |  2 ++
 .../resources/sql-tests/results/describe.sql.out   | 33 +-
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
index f88200e..b557fd2 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
@@ -693,7 +693,12 @@ case class DescribeTableCommand(
 s"DESC PARTITION is not allowed on a view: ${table.identifier}")
 }
 DDLUtils.verifyPartitionProviderIsHive(spark, metadata, "DESC PARTITION")
-val partition = catalog.getPartition(table, partitionSpec)
+val normalizedPartSpec = PartitioningUtils.normalizePartitionSpec(
+  partitionSpec,
+  metadata.partitionSchema,
+  table.quotedString,
+  spark.sessionState.conf.resolver)
+val partition = catalog.getPartition(table, normalizedPartSpec)
 if (isExtended) describeFormattedDetailedPartitionInfo(table, metadata, 
partition, result)
   }
 
diff --git a/sql/core/src/test/resources/sql-tests/inputs/describe.sql 
b/sql/core/src/test/resources/sql-tests/inputs/describe.sql
index a0ee932..deff5bb 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/describe.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/describe.sql
@@ -43,6 +43,8 @@ DESC EXTENDED t PARTITION (c='Us', d=1);
 
 DESC FORMATTED t PARTITION (c='Us', d=1);
 
+DESC EXTENDED t PARTITION (C='Us', D=1);
+
 -- NoSuchPartitionException: Partition not found in table
 DESC t PARTITION (c='Us', d=2);
 
diff --git a/sql/core/src/test/resources/sql-tests/results/describe.sql.out 
b/sql/core/src/test/resources/sql-tests/results/describe.sql.out
index b0d650f..c14d312 100644
--- a/sql/core/src/test/resources/sql-tests/results/describe.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/describe.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 41
+-- Number of queries: 42
 
 
 -- !query
@@ -325,6 +325,37 @@ Storage Properties [a=1, b=2]
 
 
 -- !query
+DESC EXTENDED t PARTITION (C='Us', D=1)
+-- !query schema
+struct
+-- !query output
+a  string  
+b  int 
+c  string  
+d  string  
+# Partition Information
+# col_name data_type   comment 
+c  string  
+d  string  
+   
+# Detailed Partition Information   

+Database   default 
+Table  t   
+Partition Values   [c=Us, d=1] 
+Location [not included in c

[spark] branch branch-3.2 updated (1ce678b -> 7d36373)

2021-07-21 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 1ce678b  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables
 add 7d36373  [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table 
Command with PartitionSpec

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/command/tables.scala   |  7 -
 .../test/resources/sql-tests/inputs/describe.sql   |  2 ++
 .../resources/sql-tests/results/describe.sql.out   | 33 +-
 3 files changed, 40 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (685c3fd -> 4cd6cfc)

2021-07-21 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 685c3fd  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables
 add 4cd6cfc  [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table 
Command with PartitionSpec

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/command/tables.scala   |  7 -
 .../test/resources/sql-tests/inputs/describe.sql   |  2 ++
 .../resources/sql-tests/results/describe.sql.out   | 33 +-
 3 files changed, 40 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 1ce678b  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables
1ce678b is described below

commit 1ce678b2aa4d3701b3e743c3657685942c9abfcd
Author: Shardul Mahadik 
AuthorDate: Wed Jul 21 22:40:39 2021 +0800

[SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` 
property when reading Hive tables

### What changes were proposed in this pull request?

For non-datasource Hive tables, e.g. tables written outside of Spark 
(through Hive or Trino), we have certain optimzations in Spark where we use 
Spark ORC and Parquet datasources to read these tables 
([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L128))
 rather than using the Hive serde.
If such a table contains a `path` property, Spark will try to list this 
path property in addition to the table location when creating an 
`InMemoryFileIndex`. 
([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575))
 This can lead to wrong data if `path` property points to a directory location 
or an error if `path` is not a location. A concrete example is provided in [S 
[...]

Since these tables were not written through Spark, Spark should not 
interpret this `path` property as it can be set by an external system with a 
different meaning.

### Why are the changes needed?

For better compatibility with Hive tables generated by other platforms 
(non-Spark)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test

Closes #33328 from shardulm94/spark-28266.

Authored-by: Shardul Mahadik 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 685c3fd05bf8e9d85ea9b33d4e28807d436cd5ca)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  4 +-
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
index e02589e..05aa648 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
@@ -244,7 +244,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 paths = rootPath.toString :: Nil,
 userSpecifiedSchema = Option(updatedTable.dataSchema),
 bucketSpec = None,
-options = options,
+// Do not interpret the 'path' option at all when tables are 
read using the Hive
+// source, since the URIs will already have been read from the 
table's LOCATION.
+options = options.filter { case (k, _) => 
!k.equalsIgnoreCase("path") },
 className = fileType).resolveRelation(),
   table = updatedTable)
 
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
index 1a6f684..af3d455 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
@@ -363,4 +363,51 @@ class DataSourceWithHiveMetastoreCatalogSuite
   }
 })
   }
+
+  Seq(
+"parquet" -> (
+  "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
+  HiveUtils.CONVERT_METASTORE_PARQUET.key),
+"orc" -> (
+  "org.apache.hadoop.hive.ql.io.orc.OrcSerde",
+  HiveUtils.CONVERT_METASTORE_ORC.key)
+  ).foreach { case (format, (serde, formatConvertConf)) =>
+test("SPARK-28266: convertToLogicalRelation should not interpret `path` 
property when " +
+  s"reading Hive tables using $format file format") {
+  withTempPath(dir => {
+val baseDir = dir.getAbsolutePath
+withSQLConf(formatConvertConf -> "true") {
+
+  withTable("t1") {
+hiveClient.runSqlHive(
+  s"""
+ |CREATE TABLE t1 (id bigint)
+ |ROW FORMAT SERDE '$serde'
+ |WITH SERDEPROPERTIES ('path'='someNonLocationValue')
+ |STORED AS $format LOCATION '$baseDir'
+ |""".stripMargin)
+
+assertResult(0) {
+

[spark] branch branch-3.0 updated (25caec4 -> 78ec701)

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 25caec4  [SPARK-35027][CORE] Close the inputStream in FileAppender 
when writin…
 add 78ec701  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  4 +-
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++
 2 files changed, 50 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new b3ca63c  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables
b3ca63c is described below

commit b3ca63c3ed5e8c9728738ab6b1bc143c5d0d6219
Author: Shardul Mahadik 
AuthorDate: Wed Jul 21 22:40:39 2021 +0800

[SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` 
property when reading Hive tables

### What changes were proposed in this pull request?

For non-datasource Hive tables, e.g. tables written outside of Spark 
(through Hive or Trino), we have certain optimzations in Spark where we use 
Spark ORC and Parquet datasources to read these tables 
([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L128))
 rather than using the Hive serde.
If such a table contains a `path` property, Spark will try to list this 
path property in addition to the table location when creating an 
`InMemoryFileIndex`. 
([Ref](https://github.com/apache/spark/blob/fbf53dee37129a493a4e5d5a007625b35f44fbda/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575))
 This can lead to wrong data if `path` property points to a directory location 
or an error if `path` is not a location. A concrete example is provided in [S 
[...]

Since these tables were not written through Spark, Spark should not 
interpret this `path` property as it can be set by an external system with a 
different meaning.

### Why are the changes needed?

For better compatibility with Hive tables generated by other platforms 
(non-Spark)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test

Closes #33328 from shardulm94/spark-28266.

Authored-by: Shardul Mahadik 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 685c3fd05bf8e9d85ea9b33d4e28807d436cd5ca)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  4 +-
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
index a89243c..c67bc7d 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
@@ -244,7 +244,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 paths = rootPath.toString :: Nil,
 userSpecifiedSchema = Option(updatedTable.dataSchema),
 bucketSpec = None,
-options = options,
+// Do not interpret the 'path' option at all when tables are 
read using the Hive
+// source, since the URIs will already have been read from the 
table's LOCATION.
+options = options.filter { case (k, _) => 
!k.equalsIgnoreCase("path") },
 className = fileType).resolveRelation(),
   table = updatedTable)
 
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
index 1a6f684..af3d455 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
@@ -363,4 +363,51 @@ class DataSourceWithHiveMetastoreCatalogSuite
   }
 })
   }
+
+  Seq(
+"parquet" -> (
+  "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
+  HiveUtils.CONVERT_METASTORE_PARQUET.key),
+"orc" -> (
+  "org.apache.hadoop.hive.ql.io.orc.OrcSerde",
+  HiveUtils.CONVERT_METASTORE_ORC.key)
+  ).foreach { case (format, (serde, formatConvertConf)) =>
+test("SPARK-28266: convertToLogicalRelation should not interpret `path` 
property when " +
+  s"reading Hive tables using $format file format") {
+  withTempPath(dir => {
+val baseDir = dir.getAbsolutePath
+withSQLConf(formatConvertConf -> "true") {
+
+  withTable("t1") {
+hiveClient.runSqlHive(
+  s"""
+ |CREATE TABLE t1 (id bigint)
+ |ROW FORMAT SERDE '$serde'
+ |WITH SERDEPROPERTIES ('path'='someNonLocationValue')
+ |STORED AS $format LOCATION '$baseDir'
+ |""".stripMargin)
+
+assertResult(0) {
+

[spark] branch master updated (9c8a3d3 -> 685c3fd)

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9c8a3d3  [SPARK-36228][SQL] Skip splitting a skewed partition when 
some map outputs are removed
 add 685c3fd  [SPARK-28266][SQL] convertToLogicalRelation should not 
interpret `path` property when reading Hive tables

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  4 +-
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 47 ++
 2 files changed, 50 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new f4291e3  [SPARK-36228][SQL] Skip splitting a skewed partition when 
some map outputs are removed
f4291e3 is described below

commit f4291e373ee6e80456a42711072a75659bf1e2b5
Author: Wenchen Fan 
AuthorDate: Wed Jul 21 22:17:56 2021 +0800

[SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs 
are removed

### What changes were proposed in this pull request?

Sometimes, AQE skew join optimization can fail with NPE. This is because 
AQE tries to get the shuffle block sizes, but some map outputs are missing due 
to the executor lost or something.

This PR fixes this bug by skipping skew join handling if some map outputs 
are missing in the `MapOutputTracker`.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a new UT

Closes #33445 from cloud-fan/bug.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9c8a3d3975fab1e21d9482ed327919f9904e25df)
Signed-off-by: Wenchen Fan 
---
 .../execution/adaptive/ShufflePartitionsUtil.scala |  9 --
 .../sql/execution/ShufflePartitionsUtilSuite.scala | 33 --
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala
index 837764b..4ef7d33 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala
@@ -362,11 +362,15 @@ object ShufflePartitionsUtil extends Logging {
   }
 
   /**
-   * Get the map size of the specific reduce shuffle Id.
+   * Get the map size of the specific shuffle and reduce ID. Note that, some 
map outputs can be
+   * missing due to issues like executor lost. The size will be -1 for missing 
map outputs and the
+   * caller side should take care of it.
*/
   private def getMapSizesForReduceId(shuffleId: Int, partitionId: Int): 
Array[Long] = {
 val mapOutputTracker = 
SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
-
mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses.map{_.getSizeForBlock(partitionId)}
+mapOutputTracker.shuffleStatuses(shuffleId).withMapStatuses(_.map { stat =>
+  if (stat == null) -1 else stat.getSizeForBlock(partitionId)
+})
   }
 
   /**
@@ -378,6 +382,7 @@ object ShufflePartitionsUtil extends Logging {
   reducerId: Int,
   targetSize: Long): Option[Seq[PartialReducerPartitionSpec]] = {
 val mapPartitionSizes = getMapSizesForReduceId(shuffleId, reducerId)
+if (mapPartitionSizes.exists(_ < 0)) return None
 val mapStartIndices = splitSizeListByTargetSize(mapPartitionSizes, 
targetSize)
 if (mapStartIndices.length > 1) {
   Some(mapStartIndices.indices.map { i =>
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala
index a38caa7..9f70c8a 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsUtilSuite.scala
@@ -17,10 +17,12 @@
 
 package org.apache.spark.sql.execution
 
-import org.apache.spark.{MapOutputStatistics, SparkFunSuite}
+import org.apache.spark.{LocalSparkContext, MapOutputStatistics, 
MapOutputTrackerMaster, SparkConf, SparkContext, SparkEnv, SparkFunSuite}
+import org.apache.spark.scheduler.MapStatus
 import org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil
+import org.apache.spark.storage.BlockManagerId
 
-class ShufflePartitionsUtilSuite extends SparkFunSuite {
+class ShufflePartitionsUtilSuite extends SparkFunSuite with LocalSparkContext {
 
   private def checkEstimation(
   bytesByPartitionIdArray: Array[Array[Long]],
@@ -765,4 +767,31 @@ class ShufflePartitionsUtilSuite extends SparkFunSuite {
   targetSize, 1, 0)
 assert(coalesced == Seq(expected1, expected2))
   }
+
+  test("SPARK-36228: Skip splitting a skewed partition when some map outputs 
are removed") {
+sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local[2]"))
+val mapOutputTracker = 
SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
+mapOutputTracker.registerShuffle(shuffleId = 10, numMaps = 2, numReduces = 
1)
+mapOutputTracker.registerMapOutput(shuffleId = 10, mapIndex = 0, MapStatus(
+  BlockMa

[spark] branch master updated (f56c7b7 -> 9c8a3d3)

2021-07-21 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f56c7b7  [SPARK-36208][SQL] SparkScriptTransformation should support 
ANSI interval types
 add 9c8a3d3  [SPARK-36228][SQL] Skip splitting a skewed partition when 
some map outputs are removed

No new revisions were added by this update.

Summary of changes:
 .../execution/adaptive/ShufflePartitionsUtil.scala |  9 --
 .../sql/execution/ShufflePartitionsUtilSuite.scala | 33 --
 2 files changed, 38 insertions(+), 4 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (94aece4 -> f56c7b7)

2021-07-21 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 94aece4  [SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should 
retain the LOGICAL_PLAN_TAG tag
 add f56c7b7  [SPARK-36208][SQL] SparkScriptTransformation should support 
ANSI interval types

No new revisions were added by this update.

Summary of changes:
 .../execution/BaseScriptTransformationExec.scala|  8 
 .../execution/BaseScriptTransformationSuite.scala   | 21 +
 2 files changed, 29 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated (64aee34 -> 06520b2)

2021-07-21 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 64aee34  [SPARK-36153][SQL][DOCS] Update transform doc to match the 
current code
 add 06520b2  [SPARK-35658][DOCS] Document Parquet encryption feature in 
Spark SQL

No new revisions were added by this update.

Summary of changes:
 docs/sql-data-sources-parquet.md | 65 
 1 file changed, 65 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated (b5c0f6c -> 64aee34)

2021-07-21 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b5c0f6c  [SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should 
retain the LOGICAL_PLAN_TAG tag
 add 64aee34  [SPARK-36153][SQL][DOCS] Update transform doc to match the 
current code

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref-syntax-qry-select-transform.md | 101 
 1 file changed, 88 insertions(+), 13 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

[spark] branch branch-3.2 updated: [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake

[spark] branch master updated (dcc0aaa -> 89a8319)

[spark] branch branch-3.2 updated: [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex

[spark] branch master updated (f3e2957 -> dcc0aaa)

[spark] branch master updated (de8e4be -> f3e2957)

[spark] branch branch-3.2 updated: [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package

[spark] branch branch-3.2 updated (d01e532 -> 31bb9e0)

[spark] branch master updated (dcb7db5 -> de8e4be)

[spark] branch branch-3.2 updated: [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script

[spark] branch branch-3.2 updated: [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation

[spark] branch master updated (09bebc8 -> dcb7db5)

[spark] branch master updated (ad528a0 -> 09bebc8)

[spark] branch master updated (d506815 -> ad528a0)

[spark] branch branch-3.2 updated: [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex

[spark] branch master updated (4cd6cfc -> d506815)

[spark] branch branch-3.2 updated: [SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI interval types

[spark] branch branch-3.2 updated: [SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2

[spark] branch branch-3.0 updated (78ec701 -> 0cfea30)

[spark] branch branch-3.1 updated: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec

[spark] branch branch-3.2 updated (1ce678b -> 7d36373)

[spark] branch master updated (685c3fd -> 4cd6cfc)

[spark] branch branch-3.2 updated: [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables

[spark] branch branch-3.0 updated (25caec4 -> 78ec701)

[spark] branch branch-3.1 updated: [SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables

[spark] branch master updated (9c8a3d3 -> 685c3fd)

[spark] branch branch-3.2 updated: [SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed

[spark] branch master updated (f56c7b7 -> 9c8a3d3)

[spark] branch master updated (94aece4 -> f56c7b7)

[spark] branch branch-3.2 updated (64aee34 -> 06520b2)

[spark] branch branch-3.2 updated (b5c0f6c -> 64aee34)

31 matches

Site Navigation

Mail list logo

Footer information