[GitHub] [spark-website] LuciferYang commented on pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


LuciferYang commented on PR #462:
URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548985598

   Thanks all ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch asf-site updated: Add Jie Yang to committers

2023-05-15 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new aa3fc86513 Add Jie Yang to committers
aa3fc86513 is described below

commit aa3fc86513f430f5aa9630ccf056a222c68886f1
Author: yangjie01 
AuthorDate: Tue May 16 12:46:59 2023 +0800

Add Jie Yang to committers



Author: yangjie01 

Closes #462 from LuciferYang/add-yangjie.
---
 committers.md| 1 +
 site/committers.html | 4 
 2 files changed, 5 insertions(+)

diff --git a/committers.md b/committers.md
index 827073a0d6..ce4ebefbdf 100644
--- a/committers.md
+++ b/committers.md
@@ -92,6 +92,7 @@ navigation:
 |Reynold Xin|Databricks|
 |Weichen Xu|Databricks|
 |Takeshi Yamamuro|NTT|
+|Jie Yang|Baidu|
 |Kent Yao|NetEase|
 |Burak Yavuz|Databricks|
 |Matei Zaharia|Databricks, Stanford|
diff --git a/site/committers.html b/site/committers.html
index 47a29b7ee2..c06c053ed1 100644
--- a/site/committers.html
+++ b/site/committers.html
@@ -460,6 +460,10 @@
   Takeshi Yamamuro
   NTT
 
+
+  Jie Yang
+  Baidu
+
 
   Kent Yao
   NetEase


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] LuciferYang closed pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


LuciferYang closed pull request #462: Add Jie Yang to committers
URL: https://github.com/apache/spark-website/pull/462


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43502][PYTHON][CONNECT] DataFrame.drop` should accept empty column

2023-05-15 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0df4c01b7c4 [SPARK-43502][PYTHON][CONNECT] DataFrame.drop` should 
accept empty column
0df4c01b7c4 is described below

commit 0df4c01b7c4d4476fe0de9dccb3425cc1295fc85
Author: Ruifeng Zheng 
AuthorDate: Tue May 16 12:38:08 2023 +0800

[SPARK-43502][PYTHON][CONNECT] DataFrame.drop` should accept empty column

### What changes were proposed in this pull request?
Make `DataFrame.drop` accept empty column

### Why are the changes needed?
to be consistent with vanilla PySpark

### Does this PR introduce _any_ user-facing change?
yes

```
In [1]: df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

In [2]: df.drop()
```

before:
```
In [2]: df.drop()
---
PySparkValueError Traceback (most recent call last)
Cell In[2], line 1
> 1 df.drop()

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:449, in 
DataFrame.drop(self, *cols)
444 raise PySparkTypeError(
445 error_class="NOT_COLUMN_OR_STR",
446 message_parameters={"arg_name": "cols", "arg_type": 
type(cols).__name__},
447 )
448 if len(_cols) == 0:
--> 449 raise PySparkValueError(
450 error_class="CANNOT_BE_EMPTY",
451 message_parameters={"item": "cols"},
452 )
454 return DataFrame.withPlan(
455 plan.Drop(
456 child=self._plan,
   (...)
459 session=self._session,
460 )

PySparkValueError: [CANNOT_BE_EMPTY] At least one cols must be specified.
```

after
```
In [2]: df.drop()
Out[2]: DataFrame[id: bigint, age: bigint]
```

### How was this patch tested?
enabled UT

Closes #41180 from zhengruifeng/connect_drop_empty_col.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py   | 5 -
 python/pyspark/sql/connect/plan.py| 3 ++-
 python/pyspark/sql/tests/connect/test_parity_dataframe.py | 5 -
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index ffd52cf0cec..7a5ba50b3c6 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -445,11 +445,6 @@ class DataFrame:
 error_class="NOT_COLUMN_OR_STR",
 message_parameters={"arg_name": "cols", "arg_type": 
type(cols).__name__},
 )
-if len(_cols) == 0:
-raise PySparkValueError(
-error_class="CANNOT_BE_EMPTY",
-message_parameters={"item": "cols"},
-)
 
 return DataFrame.withPlan(
 plan.Drop(
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 03aca4896be..eb4765cbd4b 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -664,7 +664,8 @@ class Drop(LogicalPlan):
 columns: List[Union[Column, str]],
 ) -> None:
 super().__init__(child)
-assert len(columns) > 0 and all(isinstance(c, (Column, str)) for c in 
columns)
+if len(columns) > 0:
+assert all(isinstance(c, (Column, str)) for c in columns)
 self._columns = columns
 
 def plan(self, session: "SparkConnectClient") -> proto.Relation:
diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py 
b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
index 34f63c1410e..a74afc4d504 100644
--- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py
+++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
@@ -84,11 +84,6 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedConnectTestCase):
 def test_to_pandas_from_mixed_dataframe(self):
 self.check_to_pandas_from_mixed_dataframe()
 
-# TODO(SPARK-43502): DataFrame.drop should support empty column
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_drop_empty_column(self):
-super().test_drop_empty_column()
-
 
 if __name__ == "__main__":
 import unittest


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] Hisoka-X commented on pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


Hisoka-X commented on PR #462:
URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548923570

   Congrats!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (23c072d2a0e -> 4bf979c969e)

2023-05-15 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 23c072d2a0e [SPARK-43517][PYTHON][DOCS] Add a migration guide for 
namedtuple monkey patch
 add 4bf979c969e [SPARK-43482][SS] Expand QueryTerminatedEvent to contain 
error class if it exists in exception

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/streaming/listener.py| 17 +
 .../sql/tests/streaming/test_streaming_listener.py  |  3 ++-
 .../spark/sql/execution/streaming/StreamExecution.scala | 11 +--
 .../spark/sql/streaming/StreamingQueryListener.scala| 13 -
 .../sql/streaming/StreamingQueryListenerSuite.scala | 10 --
 .../ui/StreamingQueryStatusListenerSuite.scala  | 10 +-
 6 files changed, 53 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch

2023-05-15 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 23c072d2a0e [SPARK-43517][PYTHON][DOCS] Add a migration guide for 
namedtuple monkey patch
23c072d2a0e is described below

commit 23c072d2a0ef046f45893d9a13f5788e6ec09ea5
Author: Hyukjin Kwon 
AuthorDate: Tue May 16 11:16:27 2023 +0900

[SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey 
patch

### What changes were proposed in this pull request?

This PR proposes to add a migration guide for 
https://github.com/apache/spark/pull/38700.

### Why are the changes needed?

To guide users about the workaround of bringing the namedtuple patch back.

### Does this PR introduce _any_ user-facing change?

Yes, it adds the migration guides for end-users.

### How was this patch tested?

CI in this PR will test it out.

Closes #41177 from HyukjinKwon/update-migration-namedtuple.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/migration_guide/pyspark_upgrade.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index d06475f9b36..7513d64ef6c 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -34,6 +34,7 @@ Upgrading from PySpark 3.3 to 3.4
 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace 
pre-existing arrays, which will NOT be over-written to follow pandas 1.4 
behaviors.
 * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` 
have got new parameter ``args`` which provides binding of named parameters to 
their SQL literals.
 * In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs 
were deprecated or removed in Spark 3.4 according to the changes made in pandas 
2.0. Please refer to the [release notes of 
pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details.
+* In Spark 3.4, the custom monkey-patch of ``collections.namedtuple`` was 
removed, and ``cloudpickle`` was used by default. To restore the previous 
behavior for any relevant pickling issue of ``collections.namedtuple``, set 
``PYSPARK_ENABLE_NAMEDTUPLE_PATCH`` environment variable to ``1``.
 
 
 Upgrading from PySpark 3.2 to 3.3


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch

2023-05-15 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 792680ea73a [SPARK-43517][PYTHON][DOCS] Add a migration guide for 
namedtuple monkey patch
792680ea73a is described below

commit 792680ea73a19e91d2a15d672bd27bc252bc1c91
Author: Hyukjin Kwon 
AuthorDate: Tue May 16 11:16:27 2023 +0900

[SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey 
patch

### What changes were proposed in this pull request?

This PR proposes to add a migration guide for 
https://github.com/apache/spark/pull/38700.

### Why are the changes needed?

To guide users about the workaround of bringing the namedtuple patch back.

### Does this PR introduce _any_ user-facing change?

Yes, it adds the migration guides for end-users.

### How was this patch tested?

CI in this PR will test it out.

Closes #41177 from HyukjinKwon/update-migration-namedtuple.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 23c072d2a0ef046f45893d9a13f5788e6ec09ea5)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/migration_guide/pyspark_upgrade.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index d06475f9b36..7513d64ef6c 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -34,6 +34,7 @@ Upgrading from PySpark 3.3 to 3.4
 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace 
pre-existing arrays, which will NOT be over-written to follow pandas 1.4 
behaviors.
 * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` 
have got new parameter ``args`` which provides binding of named parameters to 
their SQL literals.
 * In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs 
were deprecated or removed in Spark 3.4 according to the changes made in pandas 
2.0. Please refer to the [release notes of 
pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details.
+* In Spark 3.4, the custom monkey-patch of ``collections.namedtuple`` was 
removed, and ``cloudpickle`` was used by default. To restore the previous 
behavior for any relevant pickling issue of ``collections.namedtuple``, set 
``PYSPARK_ENABLE_NAMEDTUPLE_PATCH`` environment variable to ``1``.
 
 
 Upgrading from PySpark 3.2 to 3.3


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d53ddbe00fe -> 6221995f67b)

2023-05-15 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d53ddbe00fe [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava 
Cache
 add 6221995f67b [SPARK-43473][PYTHON] Support struct type in 
createDataFrame from pandas DataFrame

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/session.py|   4 +-
 python/pyspark/sql/pandas/conversion.py  |   3 +-
 python/pyspark/sql/pandas/serializers.py | 203 +++
 python/pyspark/sql/tests/test_arrow.py   |  26 
 4 files changed, 154 insertions(+), 82 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava Cache

2023-05-15 Thread joshrosen
This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d53ddbe00fe [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava 
Cache
d53ddbe00fe is described below

commit d53ddbe00fe73a703f870b0297278f3870148fc4
Author: Ziqi Liu 
AuthorDate: Mon May 15 18:47:29 2023 -0700

[SPARK-43300][CORE] NonFateSharingCache wrapper for Guava Cache

### What changes were proposed in this pull request?

Create `NonFateSharingCache` to wrap around Guava cache with a KeyLock to 
synchronize all requests with the same key, so they will run individually and 
fail as if they come one at a time.

Wrap cache in `CodeGenerator` with `NonFateSharingCache` to protect it from 
unexpected cascade failure due to cancellation from irrelevant queries that 
loading the same key. Feel free to use this in other places where we used Guava 
cache and don't want fate-sharing behavior.

Also, instead of implementing Guava Cache and LoadingCache interface, I 
define a subset of it so that we can control at compile time what cache 
operations are allowed and make sure all cache loading action go through our 
narrow waist code path with key lock. Feel free to add new APIs when needed.

### Why are the changes needed?

Guava cache is widely used in spark, however, it suffers from fate-sharing 
behavior: If there are multiple requests trying to access the same key in the 
cache at the same time when the key is not in the cache, Guava cache will block 
all requests and create the object only once. If the creation fails, all 
requests will fail immediately without retry. So we might see task failure due 
to irrelevant failure in other queries due to fate sharing.

This fate sharing behavior leads to unexpected results in some 
situation(for example, in code gen).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
UT

Closes #40982 from liuzqt/SPARK-43300.

Authored-by: Ziqi Liu 
Signed-off-by: Josh Rosen 
---
 .../apache/spark/util/NonFateSharingCache.scala|  78 
 .../spark/util/NonFateSharingCacheSuite.scala  | 140 +
 .../expressions/codegen/CodeGenerator.scala|  10 +-
 3 files changed, 225 insertions(+), 3 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/util/NonFateSharingCache.scala 
b/core/src/main/scala/org/apache/spark/util/NonFateSharingCache.scala
new file mode 100644
index 000..d9847313304
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/util/NonFateSharingCache.scala
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import java.util.concurrent.Callable
+
+import com.google.common.cache.Cache
+import com.google.common.cache.LoadingCache
+
+/**
+ * SPARK-43300: Guava cache fate-sharing behavior might lead to unexpected 
cascade failure:
+ * when multiple threads access the same key in the cache at the same time 
when the key is not in
+ * the cache, Guava cache will block all requests and load the data only once. 
If the loading fails,
+ * all requests will fail immediately without retry. Therefore individual 
failure will also fail
+ * other irrelevant queries who are waiting for the same key. Given that spark 
can cancel tasks at
+ * arbitrary times for many different reasons, fate sharing means that a task 
which gets canceled
+ * while populating a cache entry can cause spurious failures in tasks from 
unrelated jobs -- even
+ * though those tasks would have successfully populated the cache if they had 
been allowed to try.
+ *
+ * This util Cache wrapper with KeyLock to synchronize threads looking for the 
same key
+ * so that they should run individually and fail as if they had arrived one at 
a time.
+ *
+ * There are so many ways to add cache entries in Guava Cache, instead of 
implementing Guava Cache
+ * and LoadingCache interface, we expose a subset of APIs so that we can 
control at compile time
+ * what 

[spark] branch branch-3.4 updated: [SPARK-43281][SQL] Fix concurrent writer does not update file metrics

2023-05-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new ee12c81cf33 [SPARK-43281][SQL] Fix concurrent writer does not update 
file metrics
ee12c81cf33 is described below

commit ee12c81cf336e5e8408970e8a65b6890dcc0c63e
Author: ulysses-you 
AuthorDate: Tue May 16 09:42:20 2023 +0800

[SPARK-43281][SQL] Fix concurrent writer does not update file metrics

### What changes were proposed in this pull request?

`DynamicPartitionDataConcurrentWriter` it uses temp file path to get file 
status after commit task. However, the temp file has already moved to new path 
during commit task.

This pr calls `closeFile` before commit task.

### Why are the changes needed?

fix bug

### Does this PR introduce _any_ user-facing change?

yes, after this pr the metrics is correct

### How was this patch tested?

add test

Closes #40952 from ulysses-you/SPARK-43281.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 592e92262246a6345096655270e2ca114934d0eb)
Signed-off-by: Wenchen Fan 
---
 .../datasources/BasicWriteStatsTracker.scala   |  3 --
 .../datasources/FileFormatDataWriter.scala |  1 +
 .../BasicWriteTaskStatsTrackerSuite.scala  | 16 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 53 ++
 4 files changed, 68 insertions(+), 5 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
index 47685899784..8a9fbd15e2e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
@@ -159,9 +159,6 @@ class BasicWriteTaskStatsTracker(
   }
 
   override def getFinalStats(taskCommitTime: Long): WriteTaskStats = {
-submittedFiles.foreach(updateFileStats)
-submittedFiles.clear()
-
 // Reports bytesWritten and recordsWritten to the Spark output metrics.
 Option(TaskContext.get()).map(_.taskMetrics().outputMetrics).foreach { 
outputMetrics =>
   outputMetrics.setBytesWritten(numBytes)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
index 0b1b616bd83..a3d2d2ef0f4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
@@ -427,6 +427,7 @@ class DynamicPartitionDataConcurrentWriter(
   if (status.outputWriter != null) {
 try {
   status.outputWriter.close()
+  statsTrackers.foreach(_.closeFile(status.outputWriter.path()))
 } finally {
   status.outputWriter = null
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
index 96c36dd3c37..e8b9dcf172a 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
@@ -85,6 +85,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
 val missing = new Path(tempDirPath, "missing")
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(missing.toString)
+tracker.closeFile(missing.toString)
 assertStats(tracker, 0, 0)
   }
 
@@ -92,7 +93,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile("")
 intercept[IllegalArgumentException] {
-  finalStatus(tracker)
+  tracker.closeFile("")
 }
   }
 
@@ -100,7 +101,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite 
{
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(null)
 intercept[IllegalArgumentException] {
-  finalStatus(tracker)
+  tracker.closeFile(null)
 }
   }
 
@@ -109,6 +110,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite 
{
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(file.toString)
 touch(file)
+tracker.closeFile(file.toString)
 assertStats(tracker, 1, 0)
   }
 
@@ -117,6 +119,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite 
{
 val tracker = new 

[spark] branch master updated: [SPARK-43281][SQL] Fix concurrent writer does not update file metrics

2023-05-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 592e9226224 [SPARK-43281][SQL] Fix concurrent writer does not update 
file metrics
592e9226224 is described below

commit 592e92262246a6345096655270e2ca114934d0eb
Author: ulysses-you 
AuthorDate: Tue May 16 09:42:20 2023 +0800

[SPARK-43281][SQL] Fix concurrent writer does not update file metrics

### What changes were proposed in this pull request?

`DynamicPartitionDataConcurrentWriter` it uses temp file path to get file 
status after commit task. However, the temp file has already moved to new path 
during commit task.

This pr calls `closeFile` before commit task.

### Why are the changes needed?

fix bug

### Does this PR introduce _any_ user-facing change?

yes, after this pr the metrics is correct

### How was this patch tested?

add test

Closes #40952 from ulysses-you/SPARK-43281.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../datasources/BasicWriteStatsTracker.scala   |  3 --
 .../datasources/FileFormatDataWriter.scala |  1 +
 .../BasicWriteTaskStatsTrackerSuite.scala  | 16 ++-
 .../sql/test/DataFrameReaderWriterSuite.scala  | 53 ++
 4 files changed, 68 insertions(+), 5 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
index 47685899784..8a9fbd15e2e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
@@ -159,9 +159,6 @@ class BasicWriteTaskStatsTracker(
   }
 
   override def getFinalStats(taskCommitTime: Long): WriteTaskStats = {
-submittedFiles.foreach(updateFileStats)
-submittedFiles.clear()
-
 // Reports bytesWritten and recordsWritten to the Spark output metrics.
 Option(TaskContext.get()).map(_.taskMetrics().outputMetrics).foreach { 
outputMetrics =>
   outputMetrics.setBytesWritten(numBytes)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
index 0b1b616bd83..a3d2d2ef0f4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
@@ -427,6 +427,7 @@ class DynamicPartitionDataConcurrentWriter(
   if (status.outputWriter != null) {
 try {
   status.outputWriter.close()
+  statsTrackers.foreach(_.closeFile(status.outputWriter.path()))
 } finally {
   status.outputWriter = null
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
index 96c36dd3c37..e8b9dcf172a 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala
@@ -85,6 +85,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
 val missing = new Path(tempDirPath, "missing")
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(missing.toString)
+tracker.closeFile(missing.toString)
 assertStats(tracker, 0, 0)
   }
 
@@ -92,7 +93,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile("")
 intercept[IllegalArgumentException] {
-  finalStatus(tracker)
+  tracker.closeFile("")
 }
   }
 
@@ -100,7 +101,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite 
{
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(null)
 intercept[IllegalArgumentException] {
-  finalStatus(tracker)
+  tracker.closeFile(null)
 }
   }
 
@@ -109,6 +110,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite 
{
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(file.toString)
 touch(file)
+tracker.closeFile(file.toString)
 assertStats(tracker, 1, 0)
   }
 
@@ -117,6 +119,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite 
{
 val tracker = new BasicWriteTaskStatsTracker(conf)
 tracker.newFile(file.toString)
 write1(file)
+

[spark] branch master updated: [SPARK-43413][SQL] Fix IN subquery ListQuery nullability

2023-05-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2e568218300 [SPARK-43413][SQL] Fix IN subquery ListQuery nullability
2e568218300 is described below

commit 2e56821830019765bf8530e0e6a8a5abd6125293
Author: Jack Chen 
AuthorDate: Tue May 16 09:40:54 2023 +0800

[SPARK-43413][SQL] Fix IN subquery ListQuery nullability

### What changes were proposed in this pull request?
Before this PR, IN subquery expressions are incorrectly marked as 
non-nullable, even when they are actually nullable. They correctly check the 
nullability of the left-hand-side, but the right-hand-side of a IN subquery, 
the ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: `(non_nullable_col IN (select nullable_col)) <=> TRUE`. Here the 
IN expression returns NULL when the nullable_col is null, but our code marks it 
as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to `non_nullable_col IN (select 
nullable_col)`, which is an incorrect transformation because NULL values of 
nullable_col now cause the expression to yield NULL instead of FALSE.

Fix this by calculating nullability correctly from the ListQuery child 
output expressions.

This bug can potentially lead to wrong results, but in most cases this 
doesn't directly cause wrong results end-to-end, because IN subqueries are 
almost always transformed to semi/anti/existence joins in 
RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, 
which is another bug. But we can observe it causing wrong behavior in unit 
tests at least.

This is a long-standing bug that has existed at least since 2016, as long 
as the ListQuery class has existed.

### Why are the changes needed?
Fix correctness bug.

### Does this PR introduce _any_ user-facing change?
May change query results to fix correctness bug.

### How was this patch tested?
Unit tests

Closes #41094 from jchen5/listquery-nullable.

Authored-by: Jack Chen 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/predicates.scala  |  10 +-
 .../spark/sql/catalyst/expressions/subquery.scala  |  10 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  10 ++
 .../BinaryComparisonSimplificationSuite.scala  |  32 +
 .../subquery/in-subquery/in-nullability.sql.out| 141 +
 .../inputs/subquery/in-subquery/in-nullability.sql |  14 ++
 .../subquery/in-subquery/in-nullability.sql.out|  71 +++
 7 files changed, 286 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
index 38005e78653..ee2ba7c73d1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
@@ -399,7 +399,15 @@ case class InSubquery(values: Seq[Expression], query: 
ListQuery)
   }
 
   override def children: Seq[Expression] = values :+ query
-  override def nullable: Boolean = children.exists(_.nullable)
+  override def nullable: Boolean = {
+if (!SQLConf.get.getConf(SQLConf.LEGACY_IN_SUBQUERY_NULLABILITY)) {
+  values.exists(_.nullable) || query.childOutputs.exists(_.nullable)
+} else {
+  // Legacy (incorrect) behavior checked only the nullability of the 
left-hand side
+  // (see SPARK-43413).
+  values.exists(_.nullable)
+}
+  }
   override def toString: String = s"$value IN ($query)"
   override def sql: String = s"(${value.sql} IN (${query.sql}))"
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
index 1e957466308..b0f10895c17 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
@@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.logical.{Filter, HintInfo, 
LogicalPlan}
 import org.apache.spark.sql.catalyst.trees.TreePattern._
 import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.util.collection.BitSet
 
@@ -367,7 +368,14 @@ case class ListQuery(
 plan.output.head.dataType
   }
   override lazy val resolved: Boolean = childrenResolved && plan.resolved && 
numCols != 

[GitHub] [spark-website] HyukjinKwon commented on pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


HyukjinKwon commented on PR #462:
URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548802646

   @LuciferYang feel free to merge it by yourself!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b8f22f33308 [MINOR] Remove redundant character escape "\\" and add UT
b8f22f33308 is described below

commit b8f22f33308ab51b93052457dba17b04c2daeb4a
Author: panbingkun 
AuthorDate: Mon May 15 18:04:31 2023 -0500

[MINOR] Remove redundant character escape "\\" and add UT

### What changes were proposed in this pull request?
The pr aims to remove redundant character escape "\\" and add UT for 
SparkHadoopUtil.substituteHadoopVariables.

### Why are the changes needed?
Make code clean & remove warning.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA & Add new UT.

Closes #41170 from panbingkun/SparkHadoopUtil_fix.

Authored-by: panbingkun 
Signed-off-by: Sean Owen 
---
 .../org/apache/spark/deploy/SparkHadoopUtil.scala  |  4 +-
 .../apache/spark/deploy/SparkHadoopUtilSuite.scala | 52 ++
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
index 4908a081367..9ff2621b791 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -174,7 +174,7 @@ private[spark] class SparkHadoopUtil extends Logging {
  * So we need a map to track the bytes read from the child threads and 
parent thread,
  * summing them together to get the bytes read of this task.
  */
-new Function0[Long] {
+new (() => Long) {
   private val bytesReadMap = new mutable.HashMap[Long, Long]()
 
   override def apply(): Long = {
@@ -248,7 +248,7 @@ private[spark] class SparkHadoopUtil extends Logging {
 if (isGlobPath(pattern)) globPath(fs, pattern) else Seq(pattern)
   }
 
-  private val HADOOP_CONF_PATTERN = 
"(\\$\\{hadoopconf-[^\\}\\$\\s]+\\})".r.unanchored
+  private val HADOOP_CONF_PATTERN = 
"(\\$\\{hadoopconf-[^}$\\s]+})".r.unanchored
 
   /**
* Substitute variables by looking them up in Hadoop configs. Only variables 
that match the
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala
index 17f1476cd8d..6250b7d0ed2 100644
--- a/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala
@@ -123,6 +123,58 @@ class SparkHadoopUtilSuite extends SparkFunSuite {
 assertConfigValue(hadoopConf, "fs.s3a.session.token", null)
   }
 
+  test("substituteHadoopVariables") {
+val hadoopConf = new Configuration(false)
+hadoopConf.set("xxx", "yyy")
+
+val text1 = "${hadoopconf-xxx}"
+val result1 = new SparkHadoopUtil().substituteHadoopVariables(text1, 
hadoopConf)
+assert(result1 == "yyy")
+
+val text2 = "${hadoopconf-xxx"
+val result2 = new SparkHadoopUtil().substituteHadoopVariables(text2, 
hadoopConf)
+assert(result2 == "${hadoopconf-xxx")
+
+val text3 = "${hadoopconf-xxx}zzz"
+val result3 = new SparkHadoopUtil().substituteHadoopVariables(text3, 
hadoopConf)
+assert(result3 == "yyyzzz")
+
+val text4 = "www${hadoopconf-xxx}zzz"
+val result4 = new SparkHadoopUtil().substituteHadoopVariables(text4, 
hadoopConf)
+assert(result4 == "wwwyyyzzz")
+
+val text5 = "www${hadoopconf-xxx}"
+val result5 = new SparkHadoopUtil().substituteHadoopVariables(text5, 
hadoopConf)
+assert(result5 == "wwwyyy")
+
+val text6 = "www${hadoopconf-xxx"
+val result6 = new SparkHadoopUtil().substituteHadoopVariables(text6, 
hadoopConf)
+assert(result6 == "www${hadoopconf-xxx")
+
+val text7 = "www$hadoopconf-xxx}"
+val result7 = new SparkHadoopUtil().substituteHadoopVariables(text7, 
hadoopConf)
+assert(result7 == "www$hadoopconf-xxx}")
+
+val text8 = "www{hadoopconf-xxx}"
+val result8 = new SparkHadoopUtil().substituteHadoopVariables(text8, 
hadoopConf)
+assert(result8 == "www{hadoopconf-xxx}")
+  }
+
+  test("Redundant character escape '\\}' in RegExp ") {
+val HADOOP_CONF_PATTERN_1 = "(\\$\\{hadoopconf-[^}$\\s]+})".r.unanchored
+val HADOOP_CONF_PATTERN_2 = "(\\$\\{hadoopconf-[^}$\\s]+\\})".r.unanchored
+
+val text = "www${hadoopconf-xxx}zzz"
+val target1 = text match {
+  case HADOOP_CONF_PATTERN_1(matched) => text.replace(matched, "yyy")
+}
+val target2 = text match {
+  case HADOOP_CONF_PATTERN_2(matched) => text.replace(matched, "yyy")
+}
+assert(target1 == "wwwyyyzzz")
+assert(target2 == "wwwyyyzzz")
+  }
+
   /**
* Assert that a hadoop configuration option has the expected 

[spark] branch master updated: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

2023-05-15 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c8e85eab3fc [SPARK-43442][PS][CONNECT][TESTS] Split test module 
`pyspark_pandas_connect`
c8e85eab3fc is described below

commit c8e85eab3fca0e4e5f4bdf9d1d6d1702ecf3fd07
Author: Ruifeng Zheng 
AuthorDate: Mon May 15 10:56:27 2023 -0700

[SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

### What changes were proposed in this pull request?
Split test module `pyspark_pandas_connect`.

Add a new module `pyspark_pandas_slow_connect` which should keep in line 
with `pyspark_pandas_slow`

### Why are the changes needed?
`pyspark_pandas_connect` may take 3~4 hours

### Does this PR introduce _any_ user-facing change?
No, test-only

### How was this patch tested?
updated CI

Closes #41127 from zhengruifeng/test_split_pandas_connect.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml |  2 ++
 dev/sparktestsupport/modules.py  | 17 +
 dev/sparktestsupport/utils.py| 14 +++---
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 4aff1bc9753..580415540a6 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -341,6 +341,8 @@ jobs:
 pyspark-connect, pyspark-errors
   - >-
 pyspark-pandas-connect
+  - >-
+pyspark-pandas-slow-connect
 env:
   MODULES_TO_TEST: ${{ matrix.modules }}
   HADOOP_PROFILE: ${{ inputs.hadoop }}
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index b24cdbddbf6..5e374ebba97 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -852,6 +852,23 @@ pyspark_pandas_connect = Module(
 "pyspark.pandas.tests.connect.test_parity_utils",
 "pyspark.pandas.tests.connect.test_parity_window",
 "pyspark.pandas.tests.connect.indexes.test_parity_base",
+],
+excluded_python_implementations=[
+"PyPy"  # Skip these tests under PyPy since they require numpy, 
pandas, and pyarrow and
+# they aren't available there
+],
+)
+
+
+# This module should contain the same test list with 'pyspark_pandas_slow' for 
maintenance.
+pyspark_pandas_slow_connect = Module(
+name="pyspark-pandas-slow-connect",
+dependencies=[pyspark_connect, pyspark_pandas_slow],
+source_file_regexes=[
+"python/pyspark/pandas",
+],
+python_test_goals=[
+# pandas-on-Spark unittests
 "pyspark.pandas.tests.connect.indexes.test_parity_datetime",
 "pyspark.pandas.tests.connect.test_parity_dataframe",
 "pyspark.pandas.tests.connect.test_parity_dataframe_slow",
diff --git a/dev/sparktestsupport/utils.py b/dev/sparktestsupport/utils.py
index ebb69be2841..d07fc936f8f 100755
--- a/dev/sparktestsupport/utils.py
+++ b/dev/sparktestsupport/utils.py
@@ -113,24 +113,24 @@ def determine_modules_to_test(changed_modules, 
deduplicated=True):
 ... # doctest: +NORMALIZE_WHITESPACE
 ['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 
'hive-thriftserver',
  'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml', 'pyspark-mllib', 
'pyspark-pandas',
- 'pyspark-pandas-connect', 'pyspark-pandas-slow', 'pyspark-sql', 'repl', 
'sparkr', 'sql',
- 'sql-kafka-0-10']
+ 'pyspark-pandas-connect', 'pyspark-pandas-slow', 
'pyspark-pandas-slow-connect', 'pyspark-sql',
+ 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
 >>> sorted([x.name for x in determine_modules_to_test(
 ... [modules.sparkr, modules.sql], deduplicated=False)])
 ... # doctest: +NORMALIZE_WHITESPACE
 ['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 
'hive-thriftserver',
  'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml', 'pyspark-mllib', 
'pyspark-pandas',
- 'pyspark-pandas-connect', 'pyspark-pandas-slow', 'pyspark-sql', 'repl', 
'sparkr', 'sql',
- 'sql-kafka-0-10']
+ 'pyspark-pandas-connect', 'pyspark-pandas-slow', 
'pyspark-pandas-slow-connect', 'pyspark-sql',
+ 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
 >>> sorted([x.name for x in determine_modules_to_test(
 ... [modules.sql, modules.core], deduplicated=False)])
 ... # doctest: +NORMALIZE_WHITESPACE
 ['avro', 'catalyst', 'connect', 'core', 'docker-integration-tests', 
'examples', 'graphx',
  'hive', 'hive-thriftserver', 'mllib', 'mllib-local', 'protobuf', 
'pyspark-connect',
  'pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 
'pyspark-pandas-connect',
- 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 

[spark] branch master updated: [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread sunchao
This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 98da24869b6 [SPARK-43494][CORE] Directly call `replicate()` for 
`HdfsDataOutputStreamBuilder` instead of reflection in 
`SparkHadoopUtil#createFile`
98da24869b6 is described below

commit 98da24869b6510435619d0554da24c31c14da97f
Author: yangjie01 
AuthorDate: Mon May 15 08:44:39 2023 -0700

[SPARK-43494][CORE] Directly call `replicate()` for 
`HdfsDataOutputStreamBuilder` instead of reflection in 
`SparkHadoopUtil#createFile`

### What changes were proposed in this pull request?
As discussed in 
https://github.com/apache/spark/pull/40945#discussion_r1191804004 and 
https://github.com/apache/spark/pull/40945#discussion_r1191804004, 
`replicate()` is a very private method of `HdfsDataOutputStreamBuilder`, so 
this pr uses case match to further remove the use of reflection in 
`SparkHadoopUtil#createFile`.

### Why are the changes needed?
Code simplification: remove reflection calls used for compatibility with 
Hadoop2.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #41164 from LuciferYang/SPARK-43272-FOLLOWUP.

Authored-by: yangjie01 
Signed-off-by: Chao Sun 
---
 .../org/apache/spark/deploy/SparkHadoopUtil.scala  | 29 --
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
index 842d3556112..4908a081367 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -30,6 +30,7 @@ import scala.language.existentials
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs._
+import org.apache.hadoop.hdfs.DistributedFileSystem.HdfsDataOutputStreamBuilder
 import org.apache.hadoop.mapred.JobConf
 import org.apache.hadoop.security.{Credentials, UserGroupInformation}
 import org.apache.hadoop.security.token.{Token, TokenIdentifier}
@@ -566,24 +567,16 @@ private[spark] object SparkHadoopUtil extends Logging {
 if (allowEC) {
   fs.create(path)
 } else {
-  try {
-// the builder api does not resolve relative paths, nor does it create 
parent dirs, while
-// the old api does.
-if (!fs.mkdirs(path.getParent())) {
-  throw new IOException(s"Failed to create parents of $path")
-}
-val qualifiedPath = fs.makeQualified(path)
-val builder = fs.createFile(qualifiedPath)
-val builderCls = builder.getClass()
-// this may throw a NoSuchMethodException if the path is not on hdfs
-val replicateMethod = builderCls.getMethod("replicate")
-val buildMethod = builderCls.getMethod("build")
-val b2 = replicateMethod.invoke(builder)
-buildMethod.invoke(b2).asInstanceOf[FSDataOutputStream]
-  } catch {
-case  _: NoSuchMethodException =>
-  // No replicate() method, so just create a file with old apis.
-  fs.create(path)
+  // the builder api does not resolve relative paths, nor does it create 
parent dirs, while
+  // the old api does.
+  if (!fs.mkdirs(path.getParent())) {
+throw new IOException(s"Failed to create parents of $path")
+  }
+  val qualifiedPath = fs.makeQualified(path)
+  val builder = fs.createFile(qualifiedPath)
+  builder match {
+case hb: HdfsDataOutputStreamBuilder => hb.replicate().build()
+case _ => fs.create(path)
   }
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43223][CONNECT] Typed agg, reduce functions, RelationalGroupedDataset#as

2023-05-15 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ba8cae2031f [SPARK-43223][CONNECT] Typed agg, reduce functions, 
RelationalGroupedDataset#as
ba8cae2031f is described below

commit ba8cae2031f81dc326d386cbe7d19c1f0a8f239e
Author: Zhen Li 
AuthorDate: Mon May 15 11:05:33 2023 -0400

[SPARK-43223][CONNECT] Typed agg, reduce functions, 
RelationalGroupedDataset#as

### What changes were proposed in this pull request?
Added the agg, reduce support in `KeyValueGroupedDataset`.
Added `Dataset#reduce`
Added `RelationalGroupedDataset#as`.

Summary:
* `KVGDS#agg`: `KVGDS#agg` and the `RelationalGroupedDS#agg` shares the 
exact same proto. The only difference is that the KVGDS always passing a UDF as 
the first grouping expression. That's also how we tell them apart in this PR.
* `KVGDS#reduce`: Reduce is a special aggregation. The client uses an 
UnresolvedFunc "reduce" to mark the agg operator is a `ReduceAggregator` and 
calls `KVGDS#agg` directly. The server would be able to pick this func up 
directly and reuse the agg code path by sending in a `ReduceAggregator`.
* `Dataset#reduce`: This is free after `KVGDS#reduce`.
* `RelationalGroupedDS#as`: The only difference between `KVGDS` created 
using `ds#groupByKey` and `ds#agg#as` is the grouping expressions. The former 
requires one grouping func as the grouping expression, the latter uses a dummy 
func (to pass encoders/types to the server) + grouping expressions. Thus the 
server can count how many grouping expressions received and decide if the 
`KVGDS` should be created as `ds#groupByKey` or `ds#agg#as`.

Followups:
* [SPARK-43415] Support mapValues in the Agg functions.
* [SPARK-43416] The tupled ProductEncoder dose not pick up the fields names 
from the server.

### Why are the changes needed?
Missing APIs in Scala Client

### Does this PR introduce _any_ user-facing change?
Added `KeyValueGrouppedDataset#agg, reduce`, `Dataset#reduce`, 
`RelationalGroupedDataset#as` methods for the Scala client.

### How was this patch tested?
E2E tests

Closes #40796 from zhenlineo/typed-agg.

Authored-by: Zhen Li 
Signed-off-by: Herman van Hovell 
---
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  66 +++--
 .../apache/spark/sql/KeyValueGroupedDataset.scala  | 255 --
 .../spark/sql/RelationalGroupedDataset.scala   |  14 +-
 .../sql/KeyValueGroupedDatasetE2ETestSuite.scala   | 290 ++---
 .../sql/UserDefinedFunctionE2ETestSuite.scala  |  18 ++
 .../CheckConnectJvmClientCompatibility.scala   |   8 -
 .../spark/sql/connect/client/util/QueryTest.scala  |  36 ++-
 .../apache/spark/sql/connect/common/UdfUtils.scala |   4 +
 .../sql/connect/planner/SparkConnectPlanner.scala  | 209 +++
 .../spark/sql/catalyst/plans/logical/object.scala  |  16 ++
 .../main/scala/org/apache/spark/sql/Column.scala   |  13 +-
 .../apache/spark/sql/KeyValueGroupedDataset.scala  |  15 +-
 .../spark/sql/RelationalGroupedDataset.scala   |  53 ++--
 .../spark/sql/expressions/ReduceAggregator.scala   |   6 +
 .../apache/spark/sql/internal/TypedAggUtils.scala  |  62 +
 15 files changed, 883 insertions(+), 182 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala
index 555f6c312c5..7a680bde7d3 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1242,10 +1242,7 @@ class Dataset[T] private[sql] (
*/
   @scala.annotation.varargs
   def groupBy(cols: Column*): RelationalGroupedDataset = {
-new RelationalGroupedDataset(
-  toDF(),
-  cols.map(_.expr),
-  proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY)
+new RelationalGroupedDataset(toDF(), cols, 
proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY)
   }
 
   /**
@@ -1273,10 +1270,45 @@ class Dataset[T] private[sql] (
 val colNames: Seq[String] = col1 +: cols
 new RelationalGroupedDataset(
   toDF(),
-  colNames.map(colName => Column(colName).expr),
+  colNames.map(colName => Column(colName)),
   proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY)
   }
 
+  /**
+   * (Scala-specific) Reduces the elements of this Dataset using the specified 
binary function.
+   * The given `func` must be commutative and associative or the result may be 
non-deterministic.
+   *
+   * @group action
+   * @since 3.5.0
+   */
+  def reduce(func: (T, T) => T): T = {
+val udf = ScalarUserDefinedFunction(
+  function = func,
+  inputEncoders = encoder :: encoder :: Nil,
+  outputEncoder = 

[GitHub] [spark-website] huaxingao commented on pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


huaxingao commented on PR #462:
URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548006578

   Congratulations!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cadfef6f807 [SPARK-43508][DOC] Replace the link related to hadoop 
version 2 with hadoop version 3
cadfef6f807 is described below

commit cadfef6f807a75ff403f6dd9234a3996ec7c691c
Author: panbingkun 
AuthorDate: Mon May 15 09:44:03 2023 -0500

[SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop 
version 3

### What changes were proposed in this pull request?
The pr aims to replace the link related to hadoop version 2 with hadoop 
version 3

### Why are the changes needed?
Because [SPARK-40651](https://issues.apache.org/jira/browse/SPARK-40651) 
Drop Hadoop2 binary distribtuion from release process and 
[SPARK-42447](https://issues.apache.org/jira/browse/SPARK-42447) Remove Hadoop 
2 GitHub Action job.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #41171 from panbingkun/SPARK-43508.

Authored-by: panbingkun 
Signed-off-by: Sean Owen 
---
 docs/streaming-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/streaming-programming-guide.md 
b/docs/streaming-programming-guide.md
index 5ed66eab348..f8f98ca5442 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -748,7 +748,7 @@ of the store is consistent with that expected by Spark 
Streaming. It may be
 that writing directly into a destination directory is the appropriate strategy 
for
 streaming data via the chosen object store.
 
-For more details on this topic, consult the [Hadoop Filesystem 
Specification](https://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/filesystem/introduction.html).
+For more details on this topic, consult the [Hadoop Filesystem 
Specification](https://hadoop.apache.org/docs/stable3/hadoop-project-dist/hadoop-common/filesystem/introduction.html).
 
  Streams based on Custom Receivers
 {:.no_toc}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43495][BUILD] Upgrade RoaringBitmap to 0.9.44

2023-05-15 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bee8187d731 [SPARK-43495][BUILD] Upgrade RoaringBitmap to 0.9.44
bee8187d731 is described below

commit bee8187d7319ededf82701b4fd2a2928cd56c7f8
Author: yangjie01 
AuthorDate: Mon May 15 08:46:20 2023 -0500

[SPARK-43495][BUILD] Upgrade RoaringBitmap to 0.9.44

### What changes were proposed in this pull request?
This pr aims upgrade `RoaringBitmap` from 0.9.39 to 0.9.44.

### Why are the changes needed?
The new version brings 2 bug fix:
- https://github.com/RoaringBitmap/RoaringBitmap/issues/619 | 
https://github.com/RoaringBitmap/RoaringBitmap/pull/620
- https://github.com/RoaringBitmap/RoaringBitmap/issues/623 | 
https://github.com/RoaringBitmap/RoaringBitmap/pull/624

The full release notes as follows:

- https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40
- https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41
- https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #41165 from LuciferYang/SPARK-43495.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt | 10 +-
 core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt | 10 +-
 core/benchmarks/MapStatusesConvertBenchmark-results.txt   | 10 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  4 ++--
 pom.xml   |  2 +-
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt 
b/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt
index ef9dd139ff2..f42b95e8d4c 100644
--- a/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt
+++ b/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt
@@ -2,12 +2,12 @@
 MapStatuses Convert Benchmark
 

 
-OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
-Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
+OpenJDK 64-Bit Server VM 11.0.19+7 on Linux 5.15.0-1037-azure
+Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 MapStatuses Convert:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Num Maps: 5 Fetch partitions:500   1288   1317 
 38  0.0  1288194389.0   1.0X
-Num Maps: 5 Fetch partitions:1000  2608   2671 
 65  0.0  2607771122.0   0.5X
-Num Maps: 5 Fetch partitions:1500  3985   4026 
 64  0.0  3984885770.0   0.3X
+Num Maps: 5 Fetch partitions:500   1346   1367 
 28  0.0  1345826909.0   1.0X
+Num Maps: 5 Fetch partitions:1000  2807   2818 
 11  0.0  2806866333.0   0.5X
+Num Maps: 5 Fetch partitions:1500  4287   4308 
 19  0.0  4286688536.0   0.3X
 
 
diff --git a/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt 
b/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt
index 12af87d9689..b0b61cc11ef 100644
--- a/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt
+++ b/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt
@@ -2,12 +2,12 @@
 MapStatuses Convert Benchmark
 

 
-OpenJDK 64-Bit Server VM 17.0.6+10 on Linux 5.15.0-1031-azure
-Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
+OpenJDK 64-Bit Server VM 17.0.7+7 on Linux 5.15.0-1037-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 MapStatuses Convert:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Num Maps: 5 Fetch partitions:500   1052   1061 
 12  0.0  1051946292.0   1.0X
-Num Maps: 5 Fetch partitions:1000  1888   2007 
109  0.0  1888235523.0   0.6X
-Num Maps: 5 Fetch partitions:1500  3070   3149 
 81  0.0  3070386448.0   0.3X
+Num Maps: 5 Fetch partitions:500   1041   1050 
 16  0.0  

[spark] branch branch-3.4 updated: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause

2023-05-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new cacaed8e36f [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET 
clause
cacaed8e36f is described below

commit cacaed8e36f512354dd889c30bb63d911b573342
Author: Jiaan Geng 
AuthorDate: Mon May 15 20:54:55 2023 +0800

[SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause

### What changes were proposed in this pull request?
Spark 3.4.0 released the new syntax: `OFFSET clause`.
But the SQL reference missing the description for it.

### Why are the changes needed?
Adds SQL reference for `OFFSET` clause.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could find out the SQL reference for `OFFSET` clause.

### How was this patch tested?
Manual verify.

![image](https://github.com/apache/spark/assets/8486025/55398194-5193-45eb-ac04-10f5f0793f7f)

![image](https://github.com/apache/spark/assets/8486025/fef0abc1-7dfa-44e2-b2e0-a56fa82a0817)

![image](https://github.com/apache/spark/assets/8486025/5ab9dc39-6812-45b4-a758-85668ab040f1)

![image](https://github.com/apache/spark/assets/8486025/b726abd4-daae-4de4-a78e-45120573e699)

Closes #41151 from beliefer/SPARK-43483.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-syntax-qry-select-case.md |  1 +
 docs/sql-ref-syntax-qry-select-clusterby.md|  1 +
 docs/sql-ref-syntax-qry-select-distribute-by.md|  1 +
 docs/sql-ref-syntax-qry-select-groupby.md  |  1 +
 docs/sql-ref-syntax-qry-select-having.md   |  1 +
 docs/sql-ref-syntax-qry-select-lateral-view.md |  1 +
 docs/sql-ref-syntax-qry-select-limit.md|  1 +
 ...imit.md => sql-ref-syntax-qry-select-offset.md} | 51 +-
 docs/sql-ref-syntax-qry-select-orderby.md  |  1 +
 docs/sql-ref-syntax-qry-select-pivot.md|  1 +
 docs/sql-ref-syntax-qry-select-sortby.md   |  1 +
 docs/sql-ref-syntax-qry-select-transform.md|  1 +
 docs/sql-ref-syntax-qry-select-unpivot.md  |  1 +
 docs/sql-ref-syntax-qry-select-where.md|  1 +
 docs/sql-ref-syntax-qry-select.md  |  1 +
 docs/sql-ref-syntax.md |  1 +
 16 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/docs/sql-ref-syntax-qry-select-case.md 
b/docs/sql-ref-syntax-qry-select-case.md
index 5d0d055919e..d9725f001ae 100644
--- a/docs/sql-ref-syntax-qry-select-case.md
+++ b/docs/sql-ref-syntax-qry-select-case.md
@@ -105,6 +105,7 @@ SELECT * FROM person
 * [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
 * [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
 * [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
+* [OFFSET Clause](sql-ref-syntax-qry-select-offset.html)
 * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
 * [UNPIVOT Clause](sql-ref-syntax-qry-select-unpivot.html)
 * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
diff --git a/docs/sql-ref-syntax-qry-select-clusterby.md 
b/docs/sql-ref-syntax-qry-select-clusterby.md
index 8f495143b2d..79d72ca438b 100644
--- a/docs/sql-ref-syntax-qry-select-clusterby.md
+++ b/docs/sql-ref-syntax-qry-select-clusterby.md
@@ -99,6 +99,7 @@ SELECT age, name FROM person CLUSTER BY age;
 * [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
 * [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
 * [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
+* [OFFSET Clause](sql-ref-syntax-qry-select-offset.html)
 * [CASE Clause](sql-ref-syntax-qry-select-case.html)
 * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
 * [UNPIVOT Clause](sql-ref-syntax-qry-select-unpivot.html)
diff --git a/docs/sql-ref-syntax-qry-select-distribute-by.md 
b/docs/sql-ref-syntax-qry-select-distribute-by.md
index f686f032ab3..91c75f61b97 100644
--- a/docs/sql-ref-syntax-qry-select-distribute-by.md
+++ b/docs/sql-ref-syntax-qry-select-distribute-by.md
@@ -94,6 +94,7 @@ SELECT age, name FROM person DISTRIBUTE BY age;
 * [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
 * [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
 * [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
+* [OFFSET Clause](sql-ref-syntax-qry-select-offset.html)
 * [CASE Clause](sql-ref-syntax-qry-select-case.html)
 * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
 * [UNPIVOT Clause](sql-ref-syntax-qry-select-unpivot.html)
diff --git a/docs/sql-ref-syntax-qry-select-groupby.md 
b/docs/sql-ref-syntax-qry-select-groupby.md
index f3157f514b4..72ccfcce099 100644
--- a/docs/sql-ref-syntax-qry-select-groupby.md
+++ b/docs/sql-ref-syntax-qry-select-groupby.md
@@ -314,6 +314,7 @@ SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM 

[spark] branch master updated (e06275c6f14 -> e1114e86194)

2023-05-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e06275c6f14 [SPARK-43485][SQL] Fix the error message for the `unit` 
argument of the datetime add/diff functions
 add e1114e86194 [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET 
clause

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref-syntax-qry-select-case.md |  1 +
 docs/sql-ref-syntax-qry-select-clusterby.md|  1 +
 docs/sql-ref-syntax-qry-select-distribute-by.md|  1 +
 docs/sql-ref-syntax-qry-select-groupby.md  |  1 +
 docs/sql-ref-syntax-qry-select-having.md   |  1 +
 docs/sql-ref-syntax-qry-select-lateral-view.md |  1 +
 docs/sql-ref-syntax-qry-select-limit.md|  1 +
 ...imit.md => sql-ref-syntax-qry-select-offset.md} | 51 +-
 docs/sql-ref-syntax-qry-select-orderby.md  |  1 +
 docs/sql-ref-syntax-qry-select-pivot.md|  1 +
 docs/sql-ref-syntax-qry-select-sortby.md   |  1 +
 docs/sql-ref-syntax-qry-select-transform.md|  1 +
 docs/sql-ref-syntax-qry-select-unpivot.md  |  1 +
 docs/sql-ref-syntax-qry-select-where.md|  1 +
 docs/sql-ref-syntax-qry-select.md  |  1 +
 docs/sql-ref-syntax.md |  1 +
 16 files changed, 36 insertions(+), 30 deletions(-)
 copy docs/{sql-ref-syntax-qry-select-limit.md => 
sql-ref-syntax-qry-select-offset.md} (68%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions

2023-05-15 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e06275c6f14 [SPARK-43485][SQL] Fix the error message for the `unit` 
argument of the datetime add/diff functions
e06275c6f14 is described below

commit e06275c6f14e88ba583ffb3aac1159718a8cae83
Author: Max Gekk 
AuthorDate: Mon May 15 13:55:18 2023 +0300

[SPARK-43485][SQL] Fix the error message for the `unit` argument of the 
datetime add/diff functions

### What changes were proposed in this pull request?
In the PR, I propose to extend the grammar rule of the 
`DATEADD`/`TIMESTAMPADD` and `DATEDIFF`/`TIMESTAMPDIFF`, and catch wrong type 
of the first argument `unit` when an user pass a string instead of an 
identifier like `YEAR`, ..., `MICROSECOND`. In that case, Spark raised an error 
of new error class `INVALID_PARAMETER_VALUE.DATETIME_UNIT`.

### Why are the changes needed?
To make the error message clear for the case when a literal string instead 
of an identifier is passed to the datetime `ADD`/`DIFF` functions:
```sql
spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11');
[WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `dateadd` requires 2 parameters but 
the actual number is 3. Please, refer to 
'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix.; line 
1 pos 7
```

### Does this PR introduce _any_ user-facing change?
Yes, it changes the error class.

After the changes:
```sql
spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11');

[INVALID_PARAMETER_VALUE.DATETIME_UNIT] The value of parameter(s) `unit` in 
`dateadd` is invalid: expects one of the units without quotes YEAR, QUARTER, 
MONTH, WEEK, DAY, DAYOFYEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, 
but got the string literal 'MONTH'.(line 1, pos 7)

== SQL ==
select dateadd('MONTH', 1, date'2023-05-11')
---^^^
```

### How was this patch tested?
By running the existing test suites:
```
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"
```

Closes #41143 from MaxGekk/dateadd-unit-error.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   |   5 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   4 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |  18 +-
 .../spark/sql/errors/QueryParsingErrors.scala  |  14 ++
 .../sql-tests/analyzer-results/ansi/date.sql.out   |  88 ++
 .../analyzer-results/ansi/timestamp.sql.out|  88 ++
 .../sql-tests/analyzer-results/date.sql.out|  88 ++
 .../analyzer-results/datetime-legacy.sql.out   | 176 +++
 .../sql-tests/analyzer-results/timestamp.sql.out   |  88 ++
 .../timestampNTZ/timestamp-ansi.sql.out|  88 ++
 .../timestampNTZ/timestamp.sql.out |  88 ++
 .../src/test/resources/sql-tests/inputs/date.sql   |   6 +
 .../test/resources/sql-tests/inputs/timestamp.sql  |   6 +
 .../resources/sql-tests/results/ansi/date.sql.out  |  96 +++
 .../sql-tests/results/ansi/timestamp.sql.out   |  96 +++
 .../test/resources/sql-tests/results/date.sql.out  |  96 +++
 .../sql-tests/results/datetime-legacy.sql.out  | 192 +
 .../resources/sql-tests/results/timestamp.sql.out  |  96 +++
 .../results/timestampNTZ/timestamp-ansi.sql.out|  96 +++
 .../results/timestampNTZ/timestamp.sql.out |  96 +++
 20 files changed, 1521 insertions(+), 4 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index dde165e5fa9..fa838a6da76 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -1051,6 +1051,11 @@
   "expects a binary value with 16, 24 or 32 bytes, but got 
 bytes."
 ]
   },
+  "DATETIME_UNIT" : {
+"message" : [
+  "expects one of the units without quotes YEAR, QUARTER, MONTH, WEEK, 
DAY, DAYOFYEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, but got the 
string literal ."
+]
+  },
   "PATTERN" : {
 "message" : [
   "."
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index 2bc79430343..591b0839ac7 100644
--- 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -892,8 +892,8 @@ datetimeUnit
 
 primaryExpression
 : 

[GitHub] [spark-website] jerqi commented on pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


jerqi commented on PR #462:
URL: https://github.com/apache/spark-website/pull/462#issuecomment-1547493795

   Gongrats!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot

2023-05-15 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ffd64a32eb2 [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with 
empty column list and names containing dot
ffd64a32eb2 is described below

commit ffd64a32eb2609ecfb68d252723671ec5cca3ffb
Author: Ruifeng Zheng 
AuthorDate: Mon May 15 16:47:15 2023 +0800

[SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list 
and names containing dot

### What changes were proposed in this pull request?
add tests for:
1, `DataFrame.drop` with empty column;
2, `DataFrame.drop` with column names containing dot;

### Why are the changes needed?
for better test coverage, the two UTs were once broken in 
[SPARK-39895](https://issues.apache.org/jira/browse/SPARK-39895), and then 
fixed in [SPARK-42444](https://issues.apache.org/jira/browse/SPARK-42444)

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
added UTs

Closes #41167 from zhengruifeng/py_test_drop_more.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../sql/tests/connect/test_parity_dataframe.py |  5 +
 python/pyspark/sql/tests/test_dataframe.py | 24 ++
 2 files changed, 29 insertions(+)

diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py 
b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
index a74afc4d504..34f63c1410e 100644
--- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py
+++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
@@ -84,6 +84,11 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedConnectTestCase):
 def test_to_pandas_from_mixed_dataframe(self):
 self.check_to_pandas_from_mixed_dataframe()
 
+# TODO(SPARK-43502): DataFrame.drop should support empty column
+@unittest.skip("Fails in Spark Connect, should enable.")
+def test_drop_empty_column(self):
+super().test_drop_empty_column()
+
 
 if __name__ == "__main__":
 import unittest
diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 715cd1d142c..527a51cc239 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -156,6 +156,30 @@ class DataFrameTestsMixin:
 self.assertEqual(df3.drop("name", df3.age, "unknown").columns, 
["height"])
 self.assertEqual(df3.drop("name", "age", df3.height).columns, [])
 
+def test_drop_empty_column(self):
+df = self.spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+
+self.assertEqual(df.drop().columns, ["age", "name"])
+self.assertEqual(df.drop(*[]).columns, ["age", "name"])
+
+def test_drop_column_name_with_dot(self):
+df = (
+self.spark.range(1, 3)
+.withColumn("first.name", lit("Peter"))
+.withColumn("city.name", lit("raleigh"))
+.withColumn("state", lit("nc"))
+)
+
+self.assertEqual(df.drop("first.name").columns, ["id", "city.name", 
"state"])
+self.assertEqual(df.drop("city.name").columns, ["id", "first.name", 
"state"])
+self.assertEqual(df.drop("first.name", "city.name").columns, ["id", 
"state"])
+self.assertEqual(
+df.drop("first.name", "city.name", "unknown.unknown").columns, 
["id", "state"]
+)
+self.assertEqual(
+df.drop("unknown.unknown").columns, ["id", "first.name", 
"city.name", "state"]
+)
+
 def test_dropna(self):
 schema = StructType(
 [


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] mridulm commented on pull request #462: Add Jie Yang to committers

2023-05-15 Thread via GitHub


mridulm commented on PR #462:
URL: https://github.com/apache/spark-website/pull/462#issuecomment-1547339193

   Congratulations !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org