date:20211214

[spark] branch master updated: [SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8630652  [SPARK-37575][SQL][FOLLOWUP] Update migration guide for null 
values saving in CSV data source
8630652 is described below

commit 8630652f92dc02d5c27f64f81a7803d083817d4e
Author: itholic 
AuthorDate: Wed Dec 15 15:58:24 2021 +0900

[SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving 
in CSV data source

### What changes were proposed in this pull request?

This is follow-up for https://github.com/apache/spark/pull/34853, to 
mention the behavior changes to migration guide, too.

See also https://github.com/apache/spark/pull/34853#issuecomment-994266814

### Why are the changes needed?

We should mention the behavior change to the migration guide, although it's 
bug fix.

### Does this PR introduce _any_ user-facing change?

The explanation is added to the migration guide as below:

![Screen Shot 2021-12-15 at 2 54 10 
PM](https://user-images.githubusercontent.com/44108233/146131228-fbe66ea9-c226-4fb8-88aa-3c944cf047ac.png)

### How was this patch tested?

Manually built docs

Closes #34905 from itholic/SPARK-37575-followup.

Authored-by: itholic 
Signed-off-by: Hyukjin Kwon 
---
 docs/sql-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index c15f55d..51f3bd3 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -52,6 +52,8 @@ license: |
 
   - Since Spark 3.3, the `strfmt` in `format_string(strfmt, obj, ...)` and 
`printf(strfmt, obj, ...)` will no longer support to use "0$" to specify the 
first argument, the first argument should always reference by "1$" when use 
argument index to indicating the position of the argument in the argument list.
 
+  - Since Spark 3.3, nulls are written as empty strings in CSV data source by 
default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted 
empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`.
+
 ## Upgrading from Spark SQL 3.1 to 3.2
 
   - Since Spark 3.2, ADD FILE/JAR/ARCHIVE commands require each path to be 
enclosed by `"` or `'` if the path contains whitespaces.

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: Revert "[SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings"

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 50af717  Revert "[SPARK-37575][SQL] null values should be saved as 
nothing rather than quoted empty Strings "" by default settings"
50af717 is described below

commit 50af71747be96b34f36b969e023237c7c5c0f31c
Author: Hyukjin Kwon 
AuthorDate: Wed Dec 15 12:54:28 2021 +0900

Revert "[SPARK-37575][SQL] null values should be saved as nothing rather 
than quoted empty Strings "" by default settings"

This reverts commit 62e4202b65d76b05f9f9a15819a631524c6e7985.
---
 .../apache/spark/sql/catalyst/csv/UnivocityGenerator.scala  |  2 ++
 .../spark/sql/execution/datasources/csv/CSVSuite.scala  | 13 +
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
index 8504877..2abf7bf 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
@@ -84,6 +84,8 @@ class UnivocityGenerator(
 while (i < row.numFields) {
   if (!row.isNullAt(i)) {
 values(i) = valueConverters(i).apply(row, i)
+  } else {
+values(i) = options.nullValue
   }
   i += 1
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index a472221..7efdf7c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -804,17 +804,6 @@ abstract class CSVSuite
 }
   }
 
-  test("SPARK-37575: null values should be saved as nothing rather than " +
-"quoted empty Strings \"\" with default settings") {
-withTempPath { path =>
-  Seq(("Tesla", null: String, ""))
-.toDF("make", "comment", "blank")
-.write
-.csv(path.getCanonicalPath)
-  checkAnswer(spark.read.text(path.getCanonicalPath), Row("Tesla,,\"\""))
-}
-  }
-
   test("save csv with compression codec option") {
 withTempDir { dir =>
   val csvDir = new File(dir, "csv").getCanonicalPath
@@ -1585,7 +1574,7 @@ abstract class CSVSuite
 (1, "John Doe"),
 (2, "-"),
 (3, "-"),
-(4, null)
+(4, "-")
   ).toDF("id", "name")
 
   checkAnswer(computed, expected)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c75186c -> e21e0ec)

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c75186c  [SPARK-37563][PYTHON] Implement days, seconds, microseconds 
properties of TimedeltaIndex
 add e21e0ec  [SPARK-37646][SQL] Avoid touching Scala reflection APIs in 
the lit function

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/functions.scala   | 15 ++-
 .../org/apache/spark/sql/ColumnExpressionSuite.scala  | 12 
 2 files changed, 26 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37563][PYTHON] Implement days, seconds, microseconds properties of TimedeltaIndex

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c75186c  [SPARK-37563][PYTHON] Implement days, seconds, microseconds 
properties of TimedeltaIndex
c75186c is described below

commit c75186cd111b91d13e32159169334d562bdeb767
Author: Xinrong Meng 
AuthorDate: Wed Dec 15 11:47:42 2021 +0900

[SPARK-37563][PYTHON] Implement days, seconds, microseconds properties of 
TimedeltaIndex

### What changes were proposed in this pull request?
Implement days, seconds, microseconds properties of TimedeltaIndex

### Why are the changes needed?
To be consistent with pandas.

### Does this PR introduce _any_ user-facing change?
Yes.
```py

# Positive timedelta
>>> psidx = ps.TimedeltaIndex(
... [
... timedelta(days=1),
... timedelta(seconds=1),
... timedelta(microseconds=1),
... timedelta(milliseconds=1),
... timedelta(minutes=1),
... timedelta(hours=1),
... timedelta(weeks=1),
... ],
... name="x",
... )

>>> psidx.days
Int64Index([1, 0, 0, 0, 0, 0, 7], dtype='int64', name='x')
>>> psidx.seconds
Int64Index([0, 1, 0, 0, 60, 3600, 0], dtype='int64', name='x')
>>> psidx.microseconds
Int64Index([0, 0, 1, 1000, 0, 0, 0], dtype='int64', name='x')

# Negative timedelta
>>> psidx = ps.TimedeltaIndex(
... [
... timedelta(days=-1),
... timedelta(seconds=-1),
... timedelta(microseconds=-1),
... timedelta(milliseconds=-1),
... timedelta(minutes=-1),
... timedelta(hours=-1),
... timedelta(weeks=-1),
... ],
... name="x",
... )
>>> psidx.days
Int64Index([-1, -1, -1, -1, -1, -1, -7], dtype='int64', name='x')
>>> psidx.seconds
Int64Index([0, 86399, 86399, 86399, 86340, 82800, 0], dtype='int64', 
name='x')
>>> psidx.microseconds
Int64Index([0, 0, 99, 999000, 0, 0, 0], dtype='int64', name='x')
```
### How was this patch tested?
Unit tests.

Closes #34825 from xinrong-databricks/timedeltaProperties.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py|  2 +
 .../source/reference/pyspark.pandas/indexing.rst   | 23 --
 python/pyspark/pandas/indexes/timedelta.py | 76 
 python/pyspark/pandas/missing/indexes.py   |  3 -
 python/pyspark/pandas/spark/functions.py   | 11 +++
 .../pyspark/pandas/tests/indexes/test_timedelta.py | 84 ++
 6 files changed, 189 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 5dd3ab6..297d2ea 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -614,6 +614,7 @@ pyspark_pandas = Module(
 "pyspark.pandas.indexes.base",
 "pyspark.pandas.indexes.category",
 "pyspark.pandas.indexes.datetimes",
+"pyspark.pandas.indexes.timedelta",
 "pyspark.pandas.indexes.multi",
 "pyspark.pandas.indexes.numeric",
 "pyspark.pandas.spark.accessors",
@@ -632,6 +633,7 @@ pyspark_pandas = Module(
 "pyspark.pandas.tests.data_type_ops.test_string_ops",
 "pyspark.pandas.tests.data_type_ops.test_udt_ops",
 "pyspark.pandas.tests.indexes.test_category",
+"pyspark.pandas.tests.indexes.test_timedelta",
 "pyspark.pandas.tests.plot.test_frame_plot",
 "pyspark.pandas.tests.plot.test_frame_plot_matplotlib",
 "pyspark.pandas.tests.plot.test_frame_plot_plotly",
diff --git a/python/docs/source/reference/pyspark.pandas/indexing.rst 
b/python/docs/source/reference/pyspark.pandas/indexing.rst
index 0c94012..15539fa 100644
--- a/python/docs/source/reference/pyspark.pandas/indexing.rst
+++ b/python/docs/source/reference/pyspark.pandas/indexing.rst
@@ -336,13 +336,6 @@ DatatimeIndex
 
DatetimeIndex
 
-TimedeltaIndex
---
-.. autosummary::
-   :toctree: api/
-
-   TimedeltaIndex
-
 Time/date components
 
 .. autosummary::
@@ -393,3 +386,19 @@ Time-specific operations
DatetimeIndex.ceil
DatetimeIndex.month_name
DatetimeIndex.day_name
+
+TimedeltaIndex
+--
+.. autosummary::
+   :toctree: api/
+
+   TimedeltaIndex
+
+Components
+~~
+.. autosummary::
+   :toctree: api/
+
+   TimedeltaIndex.days
+   TimedeltaIndex.seconds
+   TimedeltaIndex.microseconds
diff --git a/python/pyspark/pandas/indexes/timedelta.py 
b/python/pyspark/pand

[spark] branch master updated (c1d80bf -> 988381b)

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c1d80bf  [SPARK-37649][PYTHON] Switch default index to 
distributed-sequence by default in pandas API on Spark
 add 988381b  [SPARK-37638][PYTHON] Use existing active Spark session 
instead of SparkSession.getOrCreate in pandas API on Spark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/utils.py | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0a6be8c -> c1d80bf)

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0a6be8c  [SPARK-37629][SQL] Speed up Expression.canonicalized
 add c1d80bf  [SPARK-37649][PYTHON] Switch default index to 
distributed-sequence by default in pandas API on Spark

No new revisions were added by this update.

Summary of changes:
 .../user_guide/pandas_on_spark/best_practices.rst  |   2 +-
 .../source/user_guide/pandas_on_spark/options.rst  | 128 ++---
 python/pyspark/pandas/config.py|  15 +--
 3 files changed, 73 insertions(+), 72 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b31f7bf -> 0a6be8c)

2021-12-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b31f7bf  [SPARK-37146][PYTHON] Inline type hints for 
python/pyspark/__init__.py
 add 0a6be8c  [SPARK-37629][SQL] Speed up Expression.canonicalized

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/unresolved.scala   |  2 +-
 .../sql/catalyst/expressions/Canonicalize.scala| 67 +++---
 .../spark/sql/catalyst/expressions/Cast.scala  |  9 +++
 .../sql/catalyst/expressions/DynamicPruning.scala  |  6 +-
 .../sql/catalyst/expressions/Expression.scala  | 35 +--
 .../spark/sql/catalyst/expressions/PythonUDF.scala |  4 +-
 .../spark/sql/catalyst/expressions/ScalaUDF.scala  |  4 +-
 .../expressions/aggregate/interfaces.scala |  6 +-
 .../expressions/complexTypeExtractors.scala|  4 ++
 .../catalyst/expressions/decimalExpressions.scala  |  2 +-
 .../expressions/higherOrderFunctions.scala |  6 +-
 .../catalyst/expressions/namedExpressions.scala|  4 ++
 .../sql/catalyst/expressions/predicates.scala  | 34 +++
 .../spark/sql/catalyst/expressions/subquery.scala  | 26 -
 .../catalyst/expressions/ExpressionSetSuite.scala  | 40 ++---
 .../org/apache/spark/sql/execution/subquery.scala  |  6 +-
 16 files changed, 136 insertions(+), 119 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37146][PYTHON] Inline type hints for python/pyspark/init.py

2021-12-14 Thread zero323

This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b31f7bf  [SPARK-37146][PYTHON] Inline type hints for 
python/pyspark/__init__.py
b31f7bf is described below

commit b31f7bf4c102368b63b08cfe3d36cc1d8d128738
Author: dch nguyen 
AuthorDate: Tue Dec 14 23:10:02 2021 +0100

[SPARK-37146][PYTHON] Inline type hints for python/pyspark/__init__.py

### What changes were proposed in this pull request?
Inline type hints for python/pyspark/\_\_init\_\_.py

### Why are the changes needed?
We can take advantage of static type checking within the functions by 
inlining the type hints.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #34433 from dchvn/SPARK-37146.

Authored-by: dch nguyen 
Signed-off-by: zero323 
---
 python/pyspark/__init__.py  | 33 +-
 python/pyspark/__init__.pyi | 77 -
 python/pyspark/sql/conf.py  |  5 +--
 python/pyspark/sql/context.py   |  3 +-
 python/pyspark/sql/dataframe.py | 11 --
 5 files changed, 38 insertions(+), 91 deletions(-)

diff --git a/python/pyspark/__init__.py b/python/pyspark/__init__.py
index aab95ad..e82817f 100644
--- a/python/pyspark/__init__.py
+++ b/python/pyspark/__init__.py
@@ -48,6 +48,7 @@ Public classes:
 
 from functools import wraps
 import types
+from typing import cast, Any, Callable, Optional, TypeVar, Union
 
 from pyspark.conf import SparkConf
 from pyspark.rdd import RDD, RDDBarrier
@@ -63,8 +64,11 @@ from pyspark.profiler import Profiler, BasicProfiler
 from pyspark.version import __version__
 from pyspark._globals import _NoValue  # noqa: F401
 
+T = TypeVar("T")
+F = TypeVar("F", bound=Callable)
 
-def since(version):
+
+def since(version: Union[str, float]) -> Callable[[F], F]:
 """
 A decorator that annotates a function to append the version of Spark the 
function was added.
 """
@@ -72,7 +76,9 @@ def since(version):
 
 indent_p = re.compile(r"\n( +)")
 
-def deco(f):
+def deco(f: F) -> F:
+assert f.__doc__ is not None
+
 indents = indent_p.findall(f.__doc__)
 indent = " " * (min(len(m) for m in indents) if indents else 0)
 f.__doc__ = f.__doc__.rstrip() + "\n\n%s.. versionadded:: %s" % 
(indent, version)
@@ -81,15 +87,26 @@ def since(version):
 return deco
 
 
-def copy_func(f, name=None, sinceversion=None, doc=None):
+def copy_func(
+f: F,
+name: Optional[str] = None,
+sinceversion: Optional[Union[str, float]] = None,
+doc: Optional[str] = None,
+) -> F:
 """
 Returns a function with same code, globals, defaults, closure, and
 name (or provide a new name).
 """
 # See
 # 
http://stackoverflow.com/questions/6527633/how-can-i-make-a-deepcopy-of-a-function-in-python
+assert isinstance(f, types.FunctionType)
+
 fn = types.FunctionType(
-f.__code__, f.__globals__, name or f.__name__, f.__defaults__, 
f.__closure__
+f.__code__,
+f.__globals__,
+name or f.__name__,
+f.__defaults__,
+f.__closure__,
 )
 # in case f was given attrs (note this dict is a shallow copy):
 fn.__dict__.update(f.__dict__)
@@ -97,10 +114,10 @@ def copy_func(f, name=None, sinceversion=None, doc=None):
 fn.__doc__ = doc
 if sinceversion is not None:
 fn = since(sinceversion)(fn)
-return fn
+return cast(F, fn)
 
 
-def keyword_only(func):
+def keyword_only(func: F) -> F:
 """
 A decorator that forces keyword arguments in the wrapped method
 and saves actual input keyword arguments in `_input_kwargs`.
@@ -111,13 +128,13 @@ def keyword_only(func):
 """
 
 @wraps(func)
-def wrapper(self, *args, **kwargs):
+def wrapper(self: Any, *args: Any, **kwargs: Any) -> Any:
 if len(args) > 0:
 raise TypeError("Method %s forces keyword arguments." % 
func.__name__)
 self._input_kwargs = kwargs
 return func(self, **kwargs)
 
-return wrapper
+return cast(F, wrapper)
 
 
 # To avoid circular dependencies
diff --git a/python/pyspark/__init__.pyi b/python/pyspark/__init__.pyi
deleted file mode 100644
index fb045f2..000
--- a/python/pyspark/__init__.pyi
+++ /dev/null
@@ -1,77 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one
-# or more contributor license agreements.  See the NOTICE file
-# distributed with this work for additional information
-# regarding copyright ownership.  The ASF licenses this file
-# to you under the Apache License, Version 2.0 (the
-# "License"); you may not use this file except in compliance
-# with the License.  You may obtain a copy of the License at
-#
-#   http://www.apache.org/licenses/LICENSE-2

[spark] branch master updated (77b164a -> 51ee425)

2021-12-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 77b164a  [SPARK-37592][SQL] Improve performance of `JoinSelection`
 add 51ee425  [SPARK-37645][K8S][TESTS] Rename a variable 
`labledConfigMaps` to `labeledConfigMaps`

No new revisions were added by this update.

Summary of changes:
 .../cluster/k8s/KubernetesClusterSchedulerBackendSuite.scala   | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37217][SQL][3.2] The number of dynamic partitions should early check when writing to external tables

2021-12-14 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 9cd64f8  [SPARK-37217][SQL][3.2] The number of dynamic partitions 
should early check when writing to external tables
9cd64f8 is described below

commit 9cd64f8fed396d38a4d2f64559ad00c261dad47e
Author: sychen 
AuthorDate: Tue Dec 14 10:18:53 2021 -0800

[SPARK-37217][SQL][3.2] The number of dynamic partitions should early check 
when writing to external tables

### What changes were proposed in this pull request?
SPARK-29295 introduces a mechanism that writes to external tables is a 
dynamic partition method, and the data in the target partition will be deleted 
first.

Assuming that 1001 partitions are written, the data of 10001 partitions 
will be deleted first, but because `hive.exec.max.dynamic.partitions` is 1000 
by default, loadDynamicPartitions will fail at this time, but the data of 1001 
partitions has been deleted.

So we can check whether the number of dynamic partitions is greater than 
`hive.exec.max.dynamic.partitions` before deleting, it should fail quickly at 
this time.

### Why are the changes needed?
Avoid data that cannot be recovered when the job fails.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add UT

Closes #34889 from cxzl25/SPARK-37217-3.2.

Authored-by: sychen 
Signed-off-by: Chao Sun 
---
 .../spark/sql/errors/QueryExecutionErrors.scala| 11 +
 .../sql/hive/execution/InsertIntoHiveTable.scala   |  9 +++
 .../spark/sql/hive/execution/SQLQuerySuite.scala   | 28 ++
 3 files changed, 48 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 7f77243..d4fbd38 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -1804,4 +1804,15 @@ object QueryExecutionErrors {
   def pivotNotAfterGroupByUnsupportedError(): Throwable = {
 new UnsupportedOperationException("pivot is only supported after a 
groupBy")
   }
+
+  def writePartitionExceedConfigSizeWhenDynamicPartitionError(
+  numWrittenParts: Int,
+  maxDynamicPartitions: Int,
+  maxDynamicPartitionsKey: String): Throwable = {
+new SparkException(
+  s"Number of dynamic partitions created is $numWrittenParts" +
+s", which is more than $maxDynamicPartitions" +
+s". To solve this try to set $maxDynamicPartitionsKey" +
+s" to at least $numWrittenParts.")
+  }
 }
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
index 108401c..4a678f7 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
@@ -21,6 +21,7 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hive.conf.HiveConf
 import org.apache.hadoop.hive.ql.ErrorMsg
 import org.apache.hadoop.hive.ql.plan.TableDesc
 
@@ -212,6 +213,14 @@ case class InsertIntoHiveTable(
 if (partition.nonEmpty) {
   if (numDynamicPartitions > 0) {
 if (overwrite && table.tableType == CatalogTableType.EXTERNAL) {
+  val numWrittenParts = writtenParts.size
+  val maxDynamicPartitionsKey = 
HiveConf.ConfVars.DYNAMICPARTITIONMAXPARTS.varname
+  val maxDynamicPartitions = hadoopConf.getInt(maxDynamicPartitionsKey,
+HiveConf.ConfVars.DYNAMICPARTITIONMAXPARTS.defaultIntVal)
+  if (numWrittenParts > maxDynamicPartitions) {
+throw 
QueryExecutionErrors.writePartitionExceedConfigSizeWhenDynamicPartitionError(
+  numWrittenParts, maxDynamicPartitions, maxDynamicPartitionsKey)
+  }
   // SPARK-29295: When insert overwrite to a Hive external table 
partition, if the
   // partition does not exist, Hive will not check if the external 
partition directory
   // exists or not before copying files. So if users drop the 
partition, and then do
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
index 8d248bb..ba362d9 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.sca

[spark] branch master updated (d270d40 -> 77b164a)

2021-12-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d270d40  [SPARK-37635][SQL] SHOW TBLPROPERTIES should print the fully 
qualified table name
 add 77b164a  [SPARK-37592][SQL] Improve performance of `JoinSelection`

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/plans/logical/hints.scala   |  2 ++
 .../org/apache/spark/sql/execution/Columnar.scala  |  6 ++
 .../spark/sql/execution/SparkStrategies.scala  | 25 ++
 3 files changed, 20 insertions(+), 13 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ae8940c -> d270d40)

2021-12-14 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ae8940c  [SPARK-37310][SQL] Migrate ALTER NAMESPACE ... SET PROPERTIES 
to use V2 command by default
 add d270d40  [SPARK-37635][SQL] SHOW TBLPROPERTIES should print the fully 
qualified table name

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala  | 2 +-
 .../spark/sql/execution/datasources/v2/ShowTablePropertiesExec.scala   | 3 ++-
 .../src/test/resources/sql-tests/results/show-tblproperties.sql.out| 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (6a59fba -> ae8940c)

2021-12-14 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6a59fba  [SPARK-37575][SQL] null values should be saved as nothing 
rather than quoted empty Strings "" by default settings
 add ae8940c  [SPARK-37310][SQL] Migrate ALTER NAMESPACE ... SET PROPERTIES 
to use V2 command by default

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings

2021-12-14 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 62e4202  [SPARK-37575][SQL] null values should be saved as nothing 
rather than quoted empty Strings "" by default settings
62e4202 is described below

commit 62e4202b65d76b05f9f9a15819a631524c6e7985
Author: wayneguow 
AuthorDate: Tue Dec 14 11:26:34 2021 +0300

[SPARK-37575][SQL] null values should be saved as nothing rather than 
quoted empty Strings "" by default settings

### What changes were proposed in this pull request?
Fix the bug that null values are saved as quoted empty strings "" (as the 
same as empty strings) rather than nothing by default csv settings since Spark 
2.4.

### Why are the changes needed?

This is an unexpected bug, if don't fix it,  we still can't distinguish 
null values and empty strings in saved csv files.

As mentioned in [spark sql migration 
guide](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24)(2.3=>2.4),
 empty strings are saved as quoted empty string "", null values as saved as 
nothing since Spark 2.4.

> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
version 2.3 and earlier, empty strings are equal to null values and do not 
reflect to any characters in saved CSV files. For example, the row of "a", 
null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
empty (not quoted) string.

But actually, we found that null values are also saved as quoted empty 
strings "" as the same as empty strings.

For codes follows:
```scala
Seq(("Tesla", null.asInstanceOf[String], ""))
  .toDF("make", "comment", "blank")
  .coalesce(1)
  .write.csv(path)
```

actual results:
>Tesla,"",""

expected results:
>Tesla,,""

### Does this PR introduce _any_ user-facing change?

Yes, if this bug has been fixed, the output of null values would been 
changed to nothing rather than quoted empty strings "".

But, users can set nullValue to "\\"\\""(same as emptyValueInWrite's 
default value) to restore the previous behavior since 2.4.

### How was this patch tested?

Adding a test case.

Closes #34853 from wayneguow/SPARK-37575.

Lead-authored-by: wayneguow 
Co-authored-by: Wayne Guo 
Signed-off-by: Max Gekk 
(cherry picked from commit 6a59fba248359fb2614837fe8781dc63ac8fdc4c)
Signed-off-by: Max Gekk 
---
 .../apache/spark/sql/catalyst/csv/UnivocityGenerator.scala  |  2 --
 .../spark/sql/execution/datasources/csv/CSVSuite.scala  | 13 -
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
index 2abf7bf..8504877 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
@@ -84,8 +84,6 @@ class UnivocityGenerator(
 while (i < row.numFields) {
   if (!row.isNullAt(i)) {
 values(i) = valueConverters(i).apply(row, i)
-  } else {
-values(i) = options.nullValue
   }
   i += 1
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 7efdf7c..a472221 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -804,6 +804,17 @@ abstract class CSVSuite
 }
   }
 
+  test("SPARK-37575: null values should be saved as nothing rather than " +
+"quoted empty Strings \"\" with default settings") {
+withTempPath { path =>
+  Seq(("Tesla", null: String, ""))
+.toDF("make", "comment", "blank")
+.write
+.csv(path.getCanonicalPath)
+  checkAnswer(spark.read.text(path.getCanonicalPath), Row("Tesla,,\"\""))
+}
+  }
+
   test("save csv with compression codec option") {
 withTempDir { dir =>
   val csvDir = new File(dir, "csv").getCanonicalPath
@@ -1574,7 +1585,7 @@ abstract class CSVSuite
 (1, "John Doe"),
 (2, "-"),
 (3, "-"),
-(4, "-")
+(4, null)
   ).toDF("id", "name")
 
   checkAnswer(computed, expected)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail:

[spark] branch master updated: [SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings

2021-12-14 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6a59fba  [SPARK-37575][SQL] null values should be saved as nothing 
rather than quoted empty Strings "" by default settings
6a59fba is described below

commit 6a59fba248359fb2614837fe8781dc63ac8fdc4c
Author: wayneguow 
AuthorDate: Tue Dec 14 11:26:34 2021 +0300

[SPARK-37575][SQL] null values should be saved as nothing rather than 
quoted empty Strings "" by default settings

### What changes were proposed in this pull request?
Fix the bug that null values are saved as quoted empty strings "" (as the 
same as empty strings) rather than nothing by default csv settings since Spark 
2.4.

### Why are the changes needed?

This is an unexpected bug, if don't fix it,  we still can't distinguish 
null values and empty strings in saved csv files.

As mentioned in [spark sql migration 
guide](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24)(2.3=>2.4),
 empty strings are saved as quoted empty string "", null values as saved as 
nothing since Spark 2.4.

> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
version 2.3 and earlier, empty strings are equal to null values and do not 
reflect to any characters in saved CSV files. For example, the row of "a", 
null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
empty (not quoted) string.

But actually, we found that null values are also saved as quoted empty 
strings "" as the same as empty strings.

For codes follows:
```scala
Seq(("Tesla", null.asInstanceOf[String], ""))
  .toDF("make", "comment", "blank")
  .coalesce(1)
  .write.csv(path)
```

actual results:
>Tesla,"",""

expected results:
>Tesla,,""

### Does this PR introduce _any_ user-facing change?

Yes, if this bug has been fixed, the output of null values would been 
changed to nothing rather than quoted empty strings "".

But, users can set nullValue to "\\"\\""(same as emptyValueInWrite's 
default value) to restore the previous behavior since 2.4.

### How was this patch tested?

Adding a test case.

Closes #34853 from wayneguow/SPARK-37575.

Lead-authored-by: wayneguow 
Co-authored-by: Wayne Guo 
Signed-off-by: Max Gekk 
---
 .../apache/spark/sql/catalyst/csv/UnivocityGenerator.scala  |  2 --
 .../spark/sql/execution/datasources/csv/CSVSuite.scala  | 13 -
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
index 10cccd5..9d65824 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
@@ -94,8 +94,6 @@ class UnivocityGenerator(
 while (i < row.numFields) {
   if (!row.isNullAt(i)) {
 values(i) = valueConverters(i).apply(row, i)
-  } else {
-values(i) = options.nullValue
   }
   i += 1
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 8c8079f..c7328d9 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -805,6 +805,17 @@ abstract class CSVSuite
 }
   }
 
+  test("SPARK-37575: null values should be saved as nothing rather than " +
+"quoted empty Strings \"\" with default settings") {
+withTempPath { path =>
+  Seq(("Tesla", null: String, ""))
+.toDF("make", "comment", "blank")
+.write
+.csv(path.getCanonicalPath)
+  checkAnswer(spark.read.text(path.getCanonicalPath), Row("Tesla,,\"\""))
+}
+  }
+
   test("save csv with compression codec option") {
 withTempDir { dir =>
   val csvDir = new File(dir, "csv").getCanonicalPath
@@ -1769,7 +1780,7 @@ abstract class CSVSuite
 (1, "John Doe"),
 (2, "-"),
 (3, "-"),
-(4, "-")
+(4, null)
   ).toDF("id", "name")
 
   checkAnswer(computed, expected)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source

[spark] branch branch-3.2 updated: Revert "[SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings"

[spark] branch master updated (c75186c -> e21e0ec)

[spark] branch master updated: [SPARK-37563][PYTHON] Implement days, seconds, microseconds properties of TimedeltaIndex

[spark] branch master updated (c1d80bf -> 988381b)

[spark] branch master updated (0a6be8c -> c1d80bf)

[spark] branch master updated (b31f7bf -> 0a6be8c)

[spark] branch master updated: [SPARK-37146][PYTHON] Inline type hints for python/pyspark/init.py

[spark] branch master updated (77b164a -> 51ee425)

[spark] branch branch-3.2 updated: [SPARK-37217][SQL][3.2] The number of dynamic partitions should early check when writing to external tables

[spark] branch master updated (d270d40 -> 77b164a)

[spark] branch master updated (ae8940c -> d270d40)

[spark] branch master updated (6a59fba -> ae8940c)

[spark] branch branch-3.2 updated: [SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings

[spark] branch master updated: [SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings

15 matches

Site Navigation

Mail list logo

Footer information