[spark] branch master updated: [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show

2023-10-11 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5a00631e5805 [SPARK-45442][PYTHON][DOCS] Refine docstring of 
DataFrame.show
5a00631e5805 is described below

commit 5a00631e5805f3c1bc9d8e4827e2cf30ee312274
Author: allisonwang-db 
AuthorDate: Thu Oct 12 13:05:40 2023 +0800

[SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show

### What changes were proposed in this pull request?

This PR refines the docstring of `DataFrame.show` by adding more examples.

### Why are the changes needed?

To improve PySpark documentations.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doctest

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43252 from allisonwang-db/spark-45442-refine-show.

Authored-by: allisonwang-db 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/dataframe.py | 49 -
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index c44838c0ee11..637787ceb660 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -887,7 +887,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 return self._jdf.isEmpty()
 
 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: 
bool = False) -> None:
-"""Prints the first ``n`` rows to the console.
+"""
+Prints the first ``n`` rows of the DataFrame to the console.
 
 .. versionadded:: 1.3.0
 
@@ -896,20 +897,32 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 
 Parameters
 --
-n : int, optional
+n : int, optional, default 20
 Number of rows to show.
-truncate : bool or int, optional
-If set to ``True``, truncate strings longer than 20 chars by 
default.
+truncate : bool or int, optional, default True
+If set to ``True``, truncate strings longer than 20 chars.
 If set to a number greater than one, truncates long strings to 
length ``truncate``
 and align cells right.
 vertical : bool, optional
-If set to ``True``, print output rows vertically (one line
-per column value).
+If set to ``True``, print output rows vertically (one line per 
column value).
 
 Examples
 
 >>> df = spark.createDataFrame([
-... (14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
+... (14, "Tom"), (23, "Alice"), (16, "Bob"), (19, "This is a super 
long name")],
+... ["age", "name"])
+
+Show :class:`DataFrame`
+
+>>> df.show()
++---++
+|age|name|
++---++
+| 14| Tom|
+| 23|   Alice|
+| 16| Bob|
+| 19|This is a super l...|
++---++
 
 Show only top 2 rows.
 
@@ -922,6 +935,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 +---+-+
 only showing top 2 rows
 
+Show full column content without truncation.
+
+>>> df.show(truncate=False)
++---+-+
+|age|name |
++---+-+
+|14 |Tom  |
+|23 |Alice|
+|16 |Bob  |
+|19 |This is a super long name|
++---+-+
+
 Show :class:`DataFrame` where the maximum number of characters is 3.
 
 >>> df.show(truncate=3)
@@ -931,20 +956,24 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 | 14| Tom|
 | 23| Ali|
 | 16| Bob|
+| 19| Thi|
 +---++
 
 Show :class:`DataFrame` vertically.
 
 >>> df.show(vertical=True)
--RECORD 0-
+-RECORD 0
 age  | 14
 name | Tom
--RECORD 1-
+-RECORD 1
 age  | 23
 name | Alice
--RECORD 2-
+-RECORD 2
 age  | 16
 name | Bob
+-RECORD 3
+age  | 19
+name | This is a super l...
 """
 
 if not isinstance(n, int) or isinstance(n, bool):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen

2023-10-11 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 95653904a116 [SPARK-42881][SQL][FOLLOWUP] Update the results of 
JsonBenchmark-jdk21 after get_json_object supports codgen
95653904a116 is described below

commit 95653904a116a8220972108a94d70a15827f3c66
Author: panbingkun 
AuthorDate: Thu Oct 12 11:08:43 2023 +0800

[SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 
after get_json_object supports codgen

### What changes were proposed in this pull request?
The pr aims to followup https://github.com/apache/spark/pull/40506,
update JsonBenchmark-jdk21-results.txt for it.

### Why are the changes needed?
Update JsonBenchmark-jdk21-results.txt.
https://github.com/panbingkun/spark/actions/runs/6489918873

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only update the results of the benchmark,

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43346 from panbingkun/get_json_object_followup.

Authored-by: panbingkun 
Signed-off-by: yangjie01 
---
 .../benchmarks/JsonBenchmark-jdk21-results.txt | 153 +++--
 1 file changed, 77 insertions(+), 76 deletions(-)

diff --git a/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt 
b/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt
index 3b48a59e660a..f0e19c0ecf9a 100644
--- a/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt
+++ b/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt
@@ -3,127 +3,128 @@ Benchmark for performance of JSON parsing
 

 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 21+35-LTS on Linux 5.15.0-1047-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 JSON schema inferring:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding2855   2912 
 65  1.8 571.0   1.0X
-UTF-8 is set   4699   4723 
 31  1.1 939.9   0.6X
+No encoding2944   3061 
191  1.7 588.8   1.0X
+UTF-8 is set   4437   4465 
 26  1.1 887.5   0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 21+35-LTS on Linux 5.15.0-1047-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 count a short column: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding2946   2952 
 10  1.7 589.1   1.0X
-UTF-8 is set   4557   4580 
 32  1.1 911.4   0.6X
+No encoding2545   2567 
 31  2.0 509.0   1.0X
+UTF-8 is set   4020   4028 
  9  1.2 804.1   0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 21+35-LTS on Linux 5.15.0-1047-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 count a wide column:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding6977   7229 
433  0.16977.2   1.0X
-UTF-8 is set   6373   6394 
 25  0.26372.9   1.1X
+No encoding6786   6939 
264  0.16785.7   1.0X
+UTF-8 is set   5668   5680 
 11  0.25668.1   1.2X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure

[spark] branch master updated: [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result

2023-10-11 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 69cf80d25f0e [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 
'terminate' methods to consume previous 'analyze' result
69cf80d25f0e is described below

commit 69cf80d25f0e4ed46ec38a63e063471988c31732
Author: Daniel Tenedorio 
AuthorDate: Wed Oct 11 18:52:06 2023 -0700

[SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods 
to consume previous 'analyze' result

### What changes were proposed in this pull request?

This PR adds a Python UDTF API for the `eval` and `terminate` methods to 
consume the previous `analyze` result.

This also works for subclasses of the `AnalyzeResult` class, allowing the 
UDTF to return custom state from `analyze` to be consumed later.

For example, we can now define a UDTF that perform complex initialization 
in the `analyze` method and then returns the result of that in the `terminate` 
method:

```
def MyUDTF(self):
dataclass
class AnalyzeResultWithBuffer(AnalyzeResult):
buffer: str

udtf
class TestUDTF:
def __init__(self, analyze_result):
self._total = 0
self._buffer = do_complex_initialization(analyze_result.buffer)

staticmethod
def analyze(argument, _):
return AnalyzeResultWithBuffer(
schema=StructType()
.add("total", IntegerType())
.add("buffer", StringType()),
with_single_partition=True,
buffer=argument.value,
)

def eval(self, argument, row: Row):
self._total += 1

def terminate(self):
yield self._total, self._buffer

self.spark.udtf.register("my_ddtf", MyUDTF)
```

Then the results might look like:

```
sql(
"""
WITH t AS (
  SELECT id FROM range(1, 21)
)
SELECT total, buffer
FROM test_udtf("abc", TABLE(t))
"""
).collect()

> 20, "complex_initialization_result"
```

### Why are the changes needed?

In this way, the UDTF can perform potentially expensive initialization 
logic in the `analyze` method just once and result the result of such 
initialization rather than repeating the initialization in `eval`.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds new unit test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43204 from dtenedor/prepare-string.

Authored-by: Daniel Tenedorio 
Signed-off-by: Takuya UESHIN 
---
 python/docs/source/user_guide/sql/python_udtf.rst  | 124 -
 python/pyspark/sql/tests/test_udtf.py  |  53 +
 python/pyspark/sql/udtf.py |   5 +-
 python/pyspark/sql/worker/analyze_udtf.py  |   2 +
 python/pyspark/worker.py   |  34 +-
 .../spark/sql/catalyst/analysis/Analyzer.scala |   5 +-
 .../spark/sql/catalyst/expressions/PythonUDF.scala |  20 +++-
 .../execution/python/BatchEvalPythonUDTFExec.scala |   8 ++
 .../python/UserDefinedPythonFunction.scala |   7 +-
 .../sql-tests/analyzer-results/udtf/udtf.sql.out   |  26 +++--
 .../test/resources/sql-tests/inputs/udtf/udtf.sql  |   9 +-
 .../resources/sql-tests/results/udtf/udtf.sql.out  |  28 +++--
 .../apache/spark/sql/IntegratedUDFTestUtils.scala  |  64 ++-
 .../sql/execution/python/PythonUDTFSuite.scala |  42 +--
 14 files changed, 374 insertions(+), 53 deletions(-)

diff --git a/python/docs/source/user_guide/sql/python_udtf.rst 
b/python/docs/source/user_guide/sql/python_udtf.rst
index 74d8eb889861..fb42644dc702 100644
--- a/python/docs/source/user_guide/sql/python_udtf.rst
+++ b/python/docs/source/user_guide/sql/python_udtf.rst
@@ -50,10 +50,108 @@ To implement a Python UDTF, you first need to define a 
class implementing the me
 
 Notes
 -
-- This method does not accept any extra arguments. Only the default
-  constructor is supported.
 - You cannot create or reference the Spark session within the 
UDTF. Any
   attempt to do so will result in a serialization error.
+- If the below `analyze` method is implemented, it is also 
possible to define this
+  method as: `__init__(self, analyze_result: AnalyzeResult)`. In 
this case, the result
+  of the `analyze` method is passed into all future instantiations 
of this UDTF class.
+

[spark] branch master updated: [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic

2023-10-11 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 045eb2d6dade [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests 
deterministic
045eb2d6dade is described below

commit 045eb2d6dadec905f5c8f249fe19be6001107668
Author: Ruifeng Zheng 
AuthorDate: Thu Oct 12 09:20:15 2023 +0800

[SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic

### What changes were proposed in this pull request?
sort before show

### Why are the changes needed?
the orders of rows are non-deterministic after groupby
the tests fail in some env

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43331 from zhengruifeng/py_collect_groupby.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions.py | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 04968440e394..25958bdf15da 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -3751,7 +3751,8 @@ def collect_list(col: "ColumnOrName") -> Column:
 
 >>> from pyspark.sql import functions as sf
 >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], 
("id", "name"))
->>> 
df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list')).show()
+>>> df = 
df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list'))
+>>> df.orderBy(sf.desc("name")).show()
 ++---+
 |name|sorted_list|
 ++---+
@@ -3842,7 +3843,8 @@ def collect_set(col: "ColumnOrName") -> Column:
 
 >>> from pyspark.sql import functions as sf
 >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], 
("id", "name"))
->>> 
df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set')).show()
+>>> df = 
df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set'))
+>>> df.orderBy(sf.desc("name")).show()
 ++--+
 |name|sorted_set|
 ++--+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet

2023-10-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4027474cc744 [SPARK-45221][PYTHON][DOCS] Refine docstring of 
DataFrameReader.parquet
4027474cc744 is described below

commit 4027474cc74438b29b0eae38f07ab03aeab99f5a
Author: Hyukjin Kwon 
AuthorDate: Thu Oct 12 09:24:27 2023 +0900

[SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet

### What changes were proposed in this pull request?

This PR refines the docstring of DataFrameReader.parquet by adding more 
examples.

### Why are the changes needed?

To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doctest

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43301 from allisonwang-db/spark-45221-refine-parquet.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: allisonwang-db 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/readwriter.py | 68 ++--
 1 file changed, 58 insertions(+), 10 deletions(-)

diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index cfac8fdbc68b..ea429a75e157 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -495,6 +495,7 @@ class DataFrameReader(OptionUtils):
 Parameters
 --
 paths : str
+One or more file paths to read the Parquet files from.
 
 Other Parameters
 
@@ -505,24 +506,71 @@ class DataFrameReader(OptionUtils):
 
 .. # noqa
 
+Returns
+---
+:class:`DataFrame`
+A DataFrame containing the data from the Parquet files.
+
 Examples
 
+Create sample dataframes.
+
+>>> df = spark.createDataFrame(
+... [(10, "Alice"), (15, "Bob"), (20, "Tom")], schema=["age", 
"name"])
+>>> df2 = spark.createDataFrame([(70, "Alice"), (80, "Bob")], 
schema=["height", "name"])
+
 Write a DataFrame into a Parquet file and read it back.
 
 >>> import tempfile
 >>> with tempfile.TemporaryDirectory() as d:
-... # Write a DataFrame into a Parquet file
-... spark.createDataFrame(
-... [{"age": 100, "name": "Hyukjin Kwon"}]
-... ).write.mode("overwrite").format("parquet").save(d)
+... # Write a DataFrame into a Parquet file.
+... df.write.mode("overwrite").format("parquet").save(d)
 ...
 ... # Read the Parquet file as a DataFrame.
-... spark.read.parquet(d).show()
-+---++
-|age|name|
-+---++
-|100|Hyukjin Kwon|
-+---++
+... spark.read.parquet(d).orderBy("name").show()
++---+-+
+|age| name|
++---+-+
+| 10|Alice|
+| 15|  Bob|
+| 20|  Tom|
++---+-+
+
+Read a Parquet file with a specific column.
+
+>>> with tempfile.TemporaryDirectory() as d:
+... df.write.mode("overwrite").format("parquet").save(d)
+...
+... # Read the Parquet file with only the 'name' column.
+... spark.read.schema("name 
string").parquet(d).orderBy("name").show()
++-+
+| name|
++-+
+|Alice|
+|  Bob|
+|  Tom|
++-+
+
+Read multiple Parquet files and merge schema.
+
+>>> with tempfile.TemporaryDirectory() as d1, 
tempfile.TemporaryDirectory() as d2:
+... df.write.mode("overwrite").format("parquet").save(d1)
+... df2.write.mode("overwrite").format("parquet").save(d2)
+...
+... spark.read.option(
+... "mergeSchema", "true"
+... ).parquet(d1, d2).select(
+... "name", "age", "height"
+... ).orderBy("name", "age").show()
++-++--+
+| name| age|height|
++-++--+
+|Alice|NULL|70|
+|Alice|  10|  NULL|
+|  Bob|NULL|80|
+|  Bob|  15|  NULL|
+|  Tom|  20|  NULL|
++-++--+
 """
 mergeSchema = options.get("mergeSchema", None)
 pathGlobFilter = options.get("pathGlobFilter", None)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2 deleted (was b4b69957bd66)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2
in repository https://gitbox.apache.org/repos/asf/spark.git


 was b4b69957bd66 Bump org.apache.zookeeper:zookeeper in 
/connector/kafka-0-10-sql

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2 deleted (was 18ddbd8d5893)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2
in repository https://gitbox.apache.org/repos/asf/spark.git


 was 18ddbd8d5893 Bump org.apache.zookeeper:zookeeper in 
/connector/kafka-0-10

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2 deleted (was f083a40d537e)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2
in repository https://gitbox.apache.org/repos/asf/spark.git


 was f083a40d537e Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-connect-go] branch dependabot/go_modules/golang.org/x/net-0.17.0 deleted (was d3523ef)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/go_modules/golang.org/x/net-0.17.0
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


 was d3523ef  Bump golang.org/x/net from 0.10.0 to 0.17.0

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-connect-go] branch master updated: Bump golang.org/x/net from 0.10.0 to 0.17.0

2023-10-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new 3007733  Bump golang.org/x/net from 0.10.0 to 0.17.0
3007733 is described below

commit 3007733a4f6e18a6e67c6302bee5941d7f888414
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
AuthorDate: Thu Oct 12 09:12:52 2023 +0900

Bump golang.org/x/net from 0.10.0 to 0.17.0

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 
0.17.0.

Commits

https://github.com/golang/net/commit/b225e7ca6dde1ef5a5ae5ce922861bda011cfabd;>b225e7c
 http2: limit maximum handler goroutines to MaxConcurrentStreams
https://github.com/golang/net/commit/88194ad8ab44a02ea952c169883c3f57db6cf9f4;>88194ad
 go.mod: update golang.org/x dependencies
https://github.com/golang/net/commit/2b60a61f1e4cf3a5ecded0bd7e77ea168289e6de;>2b60a61
 quic: fix several bugs in flow control accounting
https://github.com/golang/net/commit/73d82efb96cacc0c378bc150b56675fc191894b9;>73d82ef
 quic: handle DATA_BLOCKED frames
https://github.com/golang/net/commit/5d5a036a503f8accd748f7453c0162115187be13;>5d5a036
 quic: handle streams moving from the data queue to the meta queue
https://github.com/golang/net/commit/350aad2603e57013fafb1a9e2089a382fe67dc80;>350aad2
 quic: correctly extend peer's flow control window after MAX_DATA
https://github.com/golang/net/commit/21814e71db756f39b69fb1a3e06350fa555a79b1;>21814e7
 quic: validate connection id transport parameters
https://github.com/golang/net/commit/a600b3518eed7a9a4e24380b4b249cb986d9b64d;>a600b35
 quic: avoid redundant MAX_DATA updates
https://github.com/golang/net/commit/ea633599b58dc6a50d33c7f5438edfaa8bc313df;>ea63359
 http2: check stream body is present on read timeout
https://github.com/golang/net/commit/ddd8598e5694aa5e966e44573a53e895f6fa5eb2;>ddd8598
 quic: version negotiation
Additional commits viewable in https://github.com/golang/net/compare/v0.10.0...v0.17.0;>compare 
view




[![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=golang.org/x/net=go_modules=0.10.0=0.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---


Dependabot commands and options


You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
- `dependabot cancel merge` will cancel a previously requested merge and 
block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
- `dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
- `dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/spark-connect-go/network/alerts).



Closes #16 from 
dependabot[bot]/dependabot/go_modules/golang.org/x/net-0.17.0.

Authored-by: dependabot[bot] 
<49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 go.mod |  6 +++---
 go.sum | 12 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/go.mod b/go.mod
index 6b4c0a0..d518fc5 100644
--- a/go.mod
+++ b/go.mod
@@ -49,10 +49,10 @@ require (
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/zeebo/xxh3 v1.0.2 // indirect
golang.org/x/mod v0.10.0 // indirect
-   golang.org/x/net v0.10.0 // indirect
+   golang.org/x/net v0.17.0 // indirect

[spark] branch master updated: [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore

2023-10-11 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a5f019554991 [SPARK-45415] Allow selective disabling of "fallocate" in 
RocksDB statestore
a5f019554991 is described below

commit a5f01955499141c53c619ddf81d6846a72ad789a
Author: Scott Schenkein 
AuthorDate: Thu Oct 12 08:44:13 2023 +0900

[SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore

### What changes were proposed in this pull request?
Our spark environment features a number of parallel structured streaming 
jobs, many of which have use state store.  Most use state store for 
dropDuplicates and work with a tiny amount of information, but a few have a 
substantially large state store requiring use of RocksDB.  In such a 
configuration, spark allocates a minimum of `spark.sql.shuffle.partitions * 
queryCount` partitions, each of which pre-allocate about 74mb (observed on 
EMR/Hadoop) disk storage for RocksDB.  This allocati [...]

This PR provides users with the option to simply disable fallocate so 
RocksDB uses far less space for the smaller state stores, reducing complexity 
and disk storage at the expense of performance.

### Why are the changes needed?

As previously mentioned, these changes allow a spark context to support 
many parallel structured streaming jobs when using RocksDB state stores without 
the need to allocate a glut of excess storage.

### Does this PR introduce _any_ user-facing change?

Users disable the fallocate rocksdb performance optimization by configuring 
`spark.sql.streaming.stateStore.rocksdb.allowFAllocate=false`

### How was this patch tested?

1) A few test cases were added
2) The state store size was validated by running this script with and 
without fallocate disabled

```
from pyspark.sql.types import StructType, StructField, StringType, 
TimestampType
import datetime

if disable_fallocate:
   spark.conf.set("spark.sql.streaming.stateStore.rocksdb.allowFAllocate", 
"false")

spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",

"org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider",
)

schema = StructType(
[
StructField("one", TimestampType(), False),
StructField("two", StringType(), True),
]
)

now = datetime.datetime.now()
data = [(now, y) for y in range(300)]
init_df = spark.createDataFrame(data, schema)

path = "/tmp/stream_try/test"
init_df.write.format("parquet").mode("append").save(path)

stream_df = spark.readStream.schema(schema).format("parquet").load(path)

stream_df = stream_df.dropDuplicates(["one"])

def foreach_batch_function(batch_df, epoch_id):
batch_df.write.format("parquet").mode("append").option("path", path + 
"_out").save()

stream_df.writeStream.foreachBatch(foreach_batch_function).option(
"checkpointLocation", path + "_checkpoint"
).start()
```

With these results (local run, docker container with small FS)
```
allowFAllocate=True (current default)
-
root0ef384f699e0:/tmp# du -sh spark-d43a2964-c92a-4d94-9fdd-f3557a651fd9
808Mspark-d43a2964-c92a-4d94-9fdd-f3557a651fd9
|
|-->4.1M
StateStoreId(opId=0,partId=0,name=default)-d59b907c-8004-47f9-a8a1-dec131f73505
|--> 
|-->4.1M
StateStoreId(opId=0,partId=199,name=default)-b49a93fe-1007-4e92-8f8f-5767aef41e5c

allowFAllocate=False (new feature)
--
root0ef384f699e0:/tmp# du -sh spark-00cb768d-2659-453c-8670-4aaf70148041

7.9Mspark-00cb768d-2659-453c-8670-4aaf70148041
|
|-->40K StateStoreId(opId=0,partId=0,name=default)-45b38d9c-737b-49b1-bb82-
|--> 
|-->40K 
StateStoreId(opId=0,partId=199,name=default)-28a6cc02-2693-4360-b47a-1f1ab0d54a61
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43202 from schenksj/feature/rocksdb_allow_fallocate.

Authored-by: Scott Schenkein 
Signed-off-by: Jungtaek Lim 
---
 docs/structured-streaming-programming-guide.md|  5 +
 .../spark/sql/execution/streaming/state/RocksDB.scala | 15 +--
 .../streaming/state/RocksDBStateStoreSuite.scala  |  2 ++
 .../sql/execution/streaming/state/RocksDBSuite.scala  |  6 ++
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 774422a9cd9d..9fb823abaa3a 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -2385,6 +2385,11 

[spark-connect-go] branch dependabot/go_modules/golang.org/x/net-0.17.0 created (now d3523ef)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/go_modules/golang.org/x/net-0.17.0
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


  at d3523ef  Bump golang.org/x/net from 0.10.0 to 0.17.0

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2 created (now f083a40d537e)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2
in repository https://gitbox.apache.org/repos/asf/spark.git


  at f083a40d537e Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2 created (now 18ddbd8d5893)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 18ddbd8d5893 Bump org.apache.zookeeper:zookeeper in 
/connector/kafka-0-10

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2 created (now b4b69957bd66)

2023-10-11 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2
in repository https://gitbox.apache.org/repos/asf/spark.git


  at b4b69957bd66 Bump org.apache.zookeeper:zookeeper in 
/connector/kafka-0-10-sql

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver

2023-10-11 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 292a1131b542 [SPARK-44855][CONNECT] Small tweaks to attaching 
ExecuteGrpcResponseSender to ExecuteResponseObserver
292a1131b542 is described below

commit 292a1131b542ddc7b227a7e51e4f4233f3d2f9d8
Author: Juliusz Sompolski 
AuthorDate: Wed Oct 11 15:01:20 2023 -0400

[SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender 
to ExecuteResponseObserver

### What changes were proposed in this pull request?

Small improvements can be made to the way new ExecuteGrpcResponseSender is 
attached to observer.
* Since now we have addGrpcResponseSender in ExecuteHolder, it should be 
ExecuteHolder responsibility to interrupt the old sender and that there is only 
one at a time, and to ExecuteResponseObserver's responsibility
* executeObserver is used as a lock for synchronization. An explicit lock 
object could be better.

Fix a small bug, when ExecuteGrpcResponseSender will not be waken up by 
interrupt if it was sleeping on the grpcCallObserverReadySignal. This would 
result in the sender potentially sleeping until the deadline (2 minutes) and 
only then removed, which would potentially delay timing the execution out by 
these 2 minutes. It should **not** cause any hang or wait on the client side, 
because if ExecuteGrpcResponseSender is interrupted, it means that the client 
has already came back with a ne [...]

### Why are the changes needed?

Minor cleanup of previous work.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests in ReattachableExecuteSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43181 from juliuszsompolski/SPARK-44855.

Authored-by: Juliusz Sompolski 
Signed-off-by: Herman van Hovell 
---
 .../execution/ExecuteGrpcResponseSender.scala  | 26 -
 .../execution/ExecuteResponseObserver.scala| 44 ++
 .../spark/sql/connect/service/ExecuteHolder.scala  |  4 ++
 3 files changed, 40 insertions(+), 34 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala
index 08496c36b28a..ba5ecc7a045a 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala
@@ -63,15 +63,15 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
   /**
* Interrupt this sender and make it exit.
*/
-  def interrupt(): Unit = executionObserver.synchronized {
+  def interrupt(): Unit = {
 interrupted = true
-executionObserver.notifyAll()
+wakeUp()
   }
 
   // For testing
-  private[connect] def setDeadline(deadlineMs: Long) = 
executionObserver.synchronized {
+  private[connect] def setDeadline(deadlineMs: Long) = {
 deadlineTimeMillis = deadlineMs
-executionObserver.notifyAll()
+wakeUp()
   }
 
   def run(lastConsumedStreamIndex: Long): Unit = {
@@ -152,9 +152,6 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
 s"lastConsumedStreamIndex=$lastConsumedStreamIndex")
 val startTime = System.nanoTime()
 
-// register to be notified about available responses.
-executionObserver.attachConsumer(this)
-
 var nextIndex = lastConsumedStreamIndex + 1
 var finished = false
 
@@ -191,7 +188,7 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
 sentResponsesSize > maximumResponseSize || deadlineTimeMillis < 
System.currentTimeMillis()
 
   logTrace(s"Trying to get next response with index=$nextIndex.")
-  executionObserver.synchronized {
+  executionObserver.responseLock.synchronized {
 logTrace(s"Acquired executionObserver lock.")
 val sleepStart = System.nanoTime()
 var sleepEnd = 0L
@@ -208,7 +205,7 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
   if (response.isEmpty) {
 val timeout = Math.max(1, deadlineTimeMillis - 
System.currentTimeMillis())
 logTrace(s"Wait for response to become available with 
timeout=$timeout ms.")
-executionObserver.wait(timeout)
+executionObserver.responseLock.wait(timeout)
 logTrace(s"Reacquired executionObserver lock after waiting.")
 sleepEnd = System.nanoTime()
   }
@@ -339,4 +336,15 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
   

[spark] branch master updated (eae5c0e1efc -> 5ad57a70e51)

2023-10-11 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from eae5c0e1efc [SPARK-45433][SQL] Fix CSV/JSON schema inference when 
timestamps do not match specified timestampFormat
 add 5ad57a70e51 [SPARK-45204][CONNECT] Add optional ExecuteHolder to 
SparkConnectPlanner

No new revisions were added by this update.

Summary of changes:
 .../connect/execution/ExecuteThreadRunner.scala|  7 +-
 .../execution/SparkConnectPlanExecution.scala  |  2 +-
 .../sql/connect/planner/SparkConnectPlanner.scala  | 84 ++
 .../connect/planner/SparkConnectPlannerSuite.scala | 11 +--
 .../plugin/SparkConnectPluginRegistrySuite.scala   |  4 +-
 5 files changed, 50 insertions(+), 58 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat

2023-10-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 7e3ddc1e582 [SPARK-45433][SQL] Fix CSV/JSON schema inference when 
timestamps do not match specified timestampFormat
7e3ddc1e582 is described below

commit 7e3ddc1e582a6e4fa96bab608c4c2bbc2c93b449
Author: Jia Fan 
AuthorDate: Wed Oct 11 19:33:23 2023 +0300

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not 
match specified timestampFormat

### What changes were proposed in this pull request?
This PR fix CSV/JSON schema inference when timestamps do not match 
specified timestampFormat will report error.
```scala
//eg
val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show()
//error
Caused by: java.time.format.DateTimeParseException: Text 
'2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
```
This bug only happend when partition had one row. The data type should be 
`StringType` not `TimestampType` because the value not match `timestampFormat`.

Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use 
`timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return 
`TimestampType`, if same partition had another row, it will use 
`tryParseTimestamp` to parse row with user defined `timestampFormat`, then 
found it can't be convert to timestamp with `timestampFormat`. Finally return 
`StringType`. But when only one row, we use 
`timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally 
timestamp not r [...]

### Why are the changes needed?
Fix schema inference bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43243 from 
Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row.

Authored-by: Jia Fan 
Signed-off-by: Max Gekk 
(cherry picked from commit eae5c0e1efce83c2bb08754784db070be285285a)
Signed-off-by: Max Gekk 
---
 .../org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala |  9 ++---
 .../org/apache/spark/sql/catalyst/json/JsonInferSchema.scala   |  8 +---
 .../apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala| 10 ++
 .../apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala  |  8 
 4 files changed, 29 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
index 51586a0065e..ec01b56f9eb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
@@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.expressions.ExprUtils
 import org.apache.spark.sql.catalyst.util.{DateFormatter, TimestampFormatter}
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.errors.QueryExecutionErrors
-import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
 import org.apache.spark.sql.types._
 
 class CSVInferSchema(val options: CSVOptions) extends Serializable {
@@ -202,8 +202,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 // We can only parse the value as TimestampNTZType if it does not have 
zone-offset or
 // time-zone component and can be parsed with the timestamp formatter.
 // Otherwise, it is likely to be a timestamp with timezone.
-if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, 
false).isDefined) {
-  SQLConf.get.timestampType
+val timestampType = SQLConf.get.timestampType
+if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
+timestampType == TimestampNTZType) &&
+timestampNTZFormatter.parseWithoutTimeZoneOptional(field, 
false).isDefined) {
+  timestampType
 } else {
   tryParseTimestamp(field)
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
index 5385afe8c93..4123c5290b6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
@@ -32,7 +32,7 @@ import 
org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil
 import org.apache.spark.sql.catalyst.util._
 import 

[spark] branch master updated: [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat

2023-10-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eae5c0e1efc [SPARK-45433][SQL] Fix CSV/JSON schema inference when 
timestamps do not match specified timestampFormat
eae5c0e1efc is described below

commit eae5c0e1efce83c2bb08754784db070be285285a
Author: Jia Fan 
AuthorDate: Wed Oct 11 19:33:23 2023 +0300

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not 
match specified timestampFormat

### What changes were proposed in this pull request?
This PR fix CSV/JSON schema inference when timestamps do not match 
specified timestampFormat will report error.
```scala
//eg
val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show()
//error
Caused by: java.time.format.DateTimeParseException: Text 
'2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
```
This bug only happend when partition had one row. The data type should be 
`StringType` not `TimestampType` because the value not match `timestampFormat`.

Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use 
`timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return 
`TimestampType`, if same partition had another row, it will use 
`tryParseTimestamp` to parse row with user defined `timestampFormat`, then 
found it can't be convert to timestamp with `timestampFormat`. Finally return 
`StringType`. But when only one row, we use 
`timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally 
timestamp not r [...]

### Why are the changes needed?
Fix schema inference bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43243 from 
Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row.

Authored-by: Jia Fan 
Signed-off-by: Max Gekk 
---
 .../org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala |  9 ++---
 .../org/apache/spark/sql/catalyst/json/JsonInferSchema.scala   |  8 +---
 .../apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala| 10 ++
 .../apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala  |  8 
 4 files changed, 29 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
index 51586a0065e..ec01b56f9eb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
@@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.expressions.ExprUtils
 import org.apache.spark.sql.catalyst.util.{DateFormatter, TimestampFormatter}
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.errors.QueryExecutionErrors
-import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
 import org.apache.spark.sql.types._
 
 class CSVInferSchema(val options: CSVOptions) extends Serializable {
@@ -202,8 +202,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 // We can only parse the value as TimestampNTZType if it does not have 
zone-offset or
 // time-zone component and can be parsed with the timestamp formatter.
 // Otherwise, it is likely to be a timestamp with timezone.
-if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, 
false).isDefined) {
-  SQLConf.get.timestampType
+val timestampType = SQLConf.get.timestampType
+if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
+timestampType == TimestampNTZType) &&
+timestampNTZFormatter.parseWithoutTimeZoneOptional(field, 
false).isDefined) {
+  timestampType
 } else {
   tryParseTimestamp(field)
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
index 5385afe8c93..4123c5290b6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
@@ -32,7 +32,7 @@ import 
org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil
 import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import 

[spark] branch master updated: [SPARK-45483][CONNECT] Correct the function groups in connect.functions

2023-10-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8e70c39650f [SPARK-45483][CONNECT] Correct the function groups in 
connect.functions
8e70c39650f is described below

commit 8e70c39650f9abd329062a3651306da2d335aeb9
Author: Ruifeng Zheng 
AuthorDate: Wed Oct 11 08:37:48 2023 -0700

[SPARK-45483][CONNECT] Correct the function groups in connect.functions

### What changes were proposed in this pull request?
Correct the function groups in connect.functions

### Why are the changes needed?
to be consistent with 
https://github.com/apache/spark/commit/17da43803fd4c405fda00ffc2c7f4ff835ab24aa

### Does this PR introduce _any_ user-facing change?
yes, will changes the scaladoc (when it is available)

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43309 from zhengruifeng/connect_function_scaladoc.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/functions.scala | 318 +++--
 1 file changed, 166 insertions(+), 152 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index 2adba11923d..9c5adca7e28 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -55,16 +55,28 @@ import org.apache.spark.util.SparkClassUtils
  * only `Column` but also other types such as a native string. The other 
variants currently exist
  * for historical reasons.
  *
- * @groupname udf_funcs UDF functions
+ * @groupname udf_funcs UDF, UDAF and UDT
  * @groupname agg_funcs Aggregate functions
- * @groupname datetime_funcs Date time functions
- * @groupname sort_funcs Sorting functions
- * @groupname normal_funcs Non-aggregate functions
- * @groupname math_funcs Math functions
+ * @groupname datetime_funcs Date and Timestamp functions
+ * @groupname sort_funcs Sort functions
+ * @groupname normal_funcs Normal functions
+ * @groupname math_funcs Mathematical functions
+ * @groupname bitwise_funcs Bitwise functions
+ * @groupname predicate_funcs Predicate functions
+ * @groupname conditional_funcs Conditional functions
+ * @groupname hash_funcs Hash functions
  * @groupname misc_funcs Misc functions
  * @groupname window_funcs Window functions
+ * @groupname generator_funcs Generator functions
  * @groupname string_funcs String functions
  * @groupname collection_funcs Collection functions
+ * @groupname array_funcs Array functions
+ * @groupname map_funcs Map functions
+ * @groupname struct_funcs Struct functions
+ * @groupname csv_funcs CSV functions
+ * @groupname json_funcs JSON functions
+ * @groupname xml_funcs XML functions
+ * @groupname url_funcs URL functions
  * @groupname partition_transforms Partition transform functions
  * @groupname Ungrouped Support functions for DataFrames
  *
@@ -101,6 +113,7 @@ object functions {
* Scala Symbol, it is converted into a [[Column]] also. Otherwise, a new 
[[Column]] is created
* to represent the literal value.
*
+   * @group normal_funcs
* @since 3.4.0
*/
   def lit(literal: Any): Column = {
@@ -145,7 +158,7 @@ object functions {
   /**
* Creates a struct with the given field names and values.
*
-   * @group normal_funcs
+   * @group struct_funcs
* @since 3.5.0
*/
   def named_struct(cols: Column*): Column = Column.fn("named_struct", cols: _*)
@@ -1610,7 +1623,7 @@ object functions {
   /**
* Creates a new array column. The input columns must all have the same data 
type.
*
-   * @group normal_funcs
+   * @group array_funcs
* @since 3.4.0
*/
   @scala.annotation.varargs
@@ -1619,7 +1632,7 @@ object functions {
   /**
* Creates a new array column. The input columns must all have the same data 
type.
*
-   * @group normal_funcs
+   * @group array_funcs
* @since 3.4.0
*/
   @scala.annotation.varargs
@@ -1632,7 +1645,7 @@ object functions {
* value1, key2, value2, ...). The key columns must all have the same data 
type, and can't be
* null. The value columns must all have the same data type.
*
-   * @group normal_funcs
+   * @group map_funcs
* @since 3.4.0
*/
   @scala.annotation.varargs
@@ -1642,7 +1655,7 @@ object functions {
* Creates a new map column. The array in the first column is used for keys. 
The array in the
* second column is used for values. All elements in the array for key 
should not be null.
*
-   * @group normal_funcs
+   * @group map_funcs
* @since 3.4.0
*/
   def map_from_arrays(keys: 

[spark] branch master updated: [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with `Reference#refersTo(null)`

2023-10-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d1aff011fe1 [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` 
with `Reference#refersTo(null)`
d1aff011fe1 is described below

commit d1aff011fe1e78788fb5cb00d41a28e5925e4572
Author: yangjie01 
AuthorDate: Wed Oct 11 08:11:52 2023 -0700

[SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with 
`Reference#refersTo(null)`

### What changes were proposed in this pull request?
This pr just replace `Reference#isEnqueued` with `Reference#refersTo` in 
`CompletionIteratorSuite`, the solution refer to


https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/ref/Reference.java#L436-L454

```
 * deprecated
 * This method was originally specified to test if a reference object 
has
 * been cleared and enqueued but was never implemented to do this test.
 * This method could be misused due to the inherent race condition
 * or without an associated {code ReferenceQueue}.
 * An application relying on this method to release critical resources
 * could cause serious performance issue.
 * An application should use {link ReferenceQueue} to reliably determine
 * what reference objects that have been enqueued or
 * {link #refersTo(Object) refersTo(null)} to determine if this 
reference
 * object has been cleared.
 *
 * return   {code true} if and only if this reference object is
 *   in its associated queue (if any).
 */
Deprecated(since="16")
public boolean isEnqueued() {
return (this.queue == ReferenceQueue.ENQUEUED);
}
```

### Why are the changes needed?
Clean up deprecated api usage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43325 from LuciferYang/SPARK-45499.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../test/scala/org/apache/spark/util/CompletionIteratorSuite.scala  | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala 
b/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala
index 29421f7aa9e..297e4fd53ab 100644
--- a/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala
+++ b/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala
@@ -57,13 +57,13 @@ class CompletionIteratorSuite extends SparkFunSuite {
 sub = null
 iter.toArray
 
-for (_ <- 1 to 100 if !ref.isEnqueued) {
+for (_ <- 1 to 100 if !ref.refersTo(null)) {
   System.gc()
-  if (!ref.isEnqueued) {
+  if (!ref.refersTo(null)) {
 Thread.sleep(10)
   }
 }
-assert(ref.isEnqueued)
+assert(ref.refersTo(null))
 assert(refQueue.poll() === ref)
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42881][SQL] Codegen Support for get_json_object

2023-10-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c2525308330 [SPARK-42881][SQL] Codegen Support for get_json_object
c2525308330 is described below

commit c252530833097759b1f943ff89b05f22025f0dd0
Author: panbingkun 
AuthorDate: Wed Oct 11 17:42:48 2023 +0300

[SPARK-42881][SQL] Codegen Support for get_json_object

### What changes were proposed in this pull request?
The PR adds Codegen Support for get_json_object.

### Why are the changes needed?
Improve codegen coverage and performance.
Github benchmark 
data(https://github.com/panbingkun/spark/actions/runs/4497396473/jobs/7912952710):
https://user-images.githubusercontent.com/15246973/227117793-bab38c42-dcc1-46de-a689-25a87b8f3561.png;>

Local benchmark data:
https://user-images.githubusercontent.com/15246973/227098745-9b360e60-fe84-4419-8b7d-073a0530816a.png;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add new UT.
Pass GA.

Closes #40506 from panbingkun/json_code_gen.

Authored-by: panbingkun 
Signed-off-by: Max Gekk 
---
 .../sql/catalyst/expressions/jsonExpressions.scala | 121 +---
 sql/core/benchmarks/JsonBenchmark-results.txt  | 127 +++--
 .../org/apache/spark/sql/JsonFunctionsSuite.scala  |  28 +
 .../execution/datasources/json/JsonBenchmark.scala |  15 ++-
 4 files changed, 208 insertions(+), 83 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index e7df542ddab..04bc457b66a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -28,7 +28,8 @@ import com.fasterxml.jackson.core.json.JsonReadFeature
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
 import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.DataTypeMismatch
-import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, CodegenFallback, ExprCode}
+import org.apache.spark.sql.catalyst.expressions.codegen.Block.BlockHelper
 import org.apache.spark.sql.catalyst.json._
 import org.apache.spark.sql.catalyst.trees.TreePattern.{JSON_TO_STRUCT, 
TreePattern}
 import org.apache.spark.sql.catalyst.util._
@@ -125,13 +126,7 @@ private[this] object SharedFactory {
   group = "json_funcs",
   since = "1.5.0")
 case class GetJsonObject(json: Expression, path: Expression)
-  extends BinaryExpression with ExpectsInputTypes with CodegenFallback {
-
-  import com.fasterxml.jackson.core.JsonToken._
-
-  import PathInstruction._
-  import SharedFactory._
-  import WriteStyle._
+  extends BinaryExpression with ExpectsInputTypes {
 
   override def left: Expression = json
   override def right: Expression = path
@@ -140,18 +135,114 @@ case class GetJsonObject(json: Expression, path: 
Expression)
   override def nullable: Boolean = true
   override def prettyName: String = "get_json_object"
 
-  @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])
+  @transient
+  private lazy val evaluator = if (path.foldable) {
+new GetJsonObjectEvaluator(path.eval().asInstanceOf[UTF8String])
+  } else {
+new GetJsonObjectEvaluator()
+  }
 
   override def eval(input: InternalRow): Any = {
-val jsonStr = json.eval(input).asInstanceOf[UTF8String]
+evaluator.setJson(json.eval(input).asInstanceOf[UTF8String])
+if (!path.foldable) {
+  evaluator.setPath(path.eval(input).asInstanceOf[UTF8String])
+}
+evaluator.evaluate()
+  }
+
+  protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val evaluatorClass = classOf[GetJsonObjectEvaluator].getName
+val initEvaluator = path.foldable match {
+  case true if path.eval() != null =>
+val cachedPath = path.eval().asInstanceOf[UTF8String]
+val refCachedPath = ctx.addReferenceObj("cachedPath", cachedPath)
+s"new $evaluatorClass($refCachedPath)"
+  case _ => s"new $evaluatorClass()"
+}
+val evaluator = ctx.addMutableState(evaluatorClass, "evaluator",
+  v => s"""$v = $initEvaluator;""", forceInline = true)
+
+val jsonEval = json.genCode(ctx)
+val pathEval = path.genCode(ctx)
+
+val setJson =
+  s"""
+ |if (${jsonEval.isNull}) {
+ |  $evaluator.setJson(null);
+ |} else {
+ |  $evaluator.setJson(${jsonEval.value});
+ |}
+ 

[spark] branch master updated: [SPARK-45467][CORE] Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass`

2023-10-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new acd5dc499d1 [SPARK-45467][CORE] Replace `Proxy.getProxyClass()` with 
`Proxy.newProxyInstance().getClass`
acd5dc499d1 is described below

commit acd5dc499d139ce8b2571a69beab0f971947adb4
Author: YangJie 
AuthorDate: Wed Oct 11 08:49:09 2023 -0500

[SPARK-45467][CORE] Replace `Proxy.getProxyClass()` with 
`Proxy.newProxyInstance().getClass`

### What changes were proposed in this pull request?
This pr replace `Proxy.getProxyClass()` with 
`Proxy.newProxyInstance().getClass` to clean up deprecated api usage ref to


https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/reflect/Proxy.java#L376-L391

```
 * deprecated Proxy classes generated in a named module are encapsulated
 *  and not accessible to code outside its module.
 *  {link Constructor#newInstance(Object...) 
Constructor.newInstance}
 *  will throw {code IllegalAccessException} when it is called on
 *  an inaccessible proxy class.
 *  Use {link #newProxyInstance(ClassLoader, Class[], 
InvocationHandler)}
 *  to create a proxy instance instead.
 *
 * see Package and Module Membership of Proxy 
Class
 * revised 9
 */
Deprecated
CallerSensitive
public static Class getProxyClass(ClassLoader loader,
 Class... interfaces)
throws IllegalArgumentException
```

For the `InvocationHandler`, since the `invoke` method  doesn't need to be 
actually called in the current scenario, but the `InvocationHandler` can't be 
null, a new `DummyInvocationHandler` has been added as follows:

```
private[spark] object DummyInvocationHandler extends InvocationHandler {
  override def invoke(proxy: Any, method: Method, args: Array[AnyRef]): 
AnyRef = {
throw new UnsupportedOperationException("Not implemented")
  }
}
```

### Why are the changes needed?
Clean up deprecated API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43291 from LuciferYang/SPARK-45467.

Lead-authored-by: YangJie 
Co-authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 .../main/scala/org/apache/spark/serializer/JavaSerializer.scala  | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala 
b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
index 95d2bdc39e1..856e639fcd9 100644
--- a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
+++ b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.serializer
 
 import java.io._
+import java.lang.reflect.{InvocationHandler, Method, Proxy}
 import java.nio.ByteBuffer
 
 import scala.reflect.ClassTag
@@ -79,7 +80,7 @@ private[spark] class JavaDeserializationStream(in: 
InputStream, loader: ClassLoa
   // scalastyle:off classforname
   val resolved = ifaces.map(iface => Class.forName(iface, false, loader))
   // scalastyle:on classforname
-  java.lang.reflect.Proxy.getProxyClass(loader, resolved: _*)
+  Proxy.newProxyInstance(loader, resolved, DummyInvocationHandler).getClass
 }
 
   }
@@ -88,6 +89,12 @@ private[spark] class JavaDeserializationStream(in: 
InputStream, loader: ClassLoa
   def close(): Unit = { objIn.close() }
 }
 
+private[spark] object DummyInvocationHandler extends InvocationHandler {
+  override def invoke(proxy: Any, method: Method, args: Array[AnyRef]): AnyRef 
= {
+throw new UnsupportedOperationException("Not implemented")
+  }
+}
+
 private object JavaDeserializationStream {
 
   val primitiveMappings = Map[String, Class[_]](


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (11af786b35c -> 97218051308)

2023-10-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 11af786b35c [SPARK-45451][SQL] Make the default storage level of 
dataset cache configurable
 add 97218051308 [SPARK-45496][CORE][DSTREAM] Fix the compilation warning 
related to `other-pure-statement`

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala | 2 +-
 pom.xml   | 3 ---
 project/SparkBuild.scala  | 4 
 .../org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala  | 2 +-
 4 files changed, 2 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45451][SQL] Make the default storage level of dataset cache configurable

2023-10-11 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 11af786b35c [SPARK-45451][SQL] Make the default storage level of 
dataset cache configurable
11af786b35c is described below

commit 11af786b35cabe6d139dd9763ccf1af9ceb7eb9f
Author: ulysses-you 
AuthorDate: Wed Oct 11 20:51:22 2023 +0800

[SPARK-45451][SQL] Make the default storage level of dataset cache 
configurable

### What changes were proposed in this pull request?

This pr adds a new config `spark.sql.defaultCacheStorageLevel` , so that 
people can use `set spark.sql.defaultCacheStorageLevel=xxx` to change the 
default storage level of `dataset.cache`.

### Why are the changes needed?

Most people use the default storage level, so this pr makes it easy to 
change the storage level without touching code.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

add test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43259 from ulysses-you/cache.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/internal/SQLConf.scala| 13 
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  5 +--
 .../execution/datasources/v2/CacheTableExec.scala  | 22 +
 .../apache/spark/sql/internal/CatalogImpl.scala|  4 +--
 .../org/apache/spark/sql/CachedTableSuite.scala| 37 ++
 5 files changed, 61 insertions(+), 20 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 65d2e6136e9..12ec9e911d3 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -44,6 +44,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import 
org.apache.spark.sql.connector.catalog.CatalogManager.SESSION_CATALOG_NAME
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.types.{AtomicType, TimestampNTZType, TimestampType}
+import org.apache.spark.storage.{StorageLevel, StorageLevelMapper}
 import org.apache.spark.unsafe.array.ByteArrayMethods
 import org.apache.spark.util.Utils
 
@@ -1563,6 +1564,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DEFAULT_CACHE_STORAGE_LEVEL = 
buildConf("spark.sql.defaultCacheStorageLevel")
+.doc("The default storage level of `dataset.cache()`, 
`catalog.cacheTable()` and " +
+  "sql query `CACHE TABLE t`.")
+.version("4.0.0")
+.stringConf
+.transform(_.toUpperCase(Locale.ROOT))
+.checkValues(StorageLevelMapper.values.map(_.name()).toSet)
+.createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name())
+
   val CROSS_JOINS_ENABLED = buildConf("spark.sql.crossJoin.enabled")
 .internal()
 .doc("When false, we will throw an error if a query contains a cartesian 
product without " +
@@ -5027,6 +5037,9 @@ class SQLConf extends Serializable with Logging with 
SqlApiConf {
 
   def groupByAliases: Boolean = getConf(GROUP_BY_ALIASES)
 
+  def defaultCacheStorageLevel: StorageLevel =
+StorageLevel.fromString(getConf(DEFAULT_CACHE_STORAGE_LEVEL))
+
   def crossJoinEnabled: Boolean = getConf(SQLConf.CROSS_JOINS_ENABLED)
 
   override def sessionLocalTimeZone: String = 
getConf(SQLConf.SESSION_LOCAL_TIMEZONE)
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 0cc037b157e..5079cfcca9d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3798,10 +3798,7 @@ class Dataset[T] private[sql](
* @group basic
* @since 1.6.0
*/
-  def persist(): this.type = {
-sparkSession.sharedState.cacheManager.cacheQuery(this)
-this
-  }
+  def persist(): this.type = 
persist(sparkSession.sessionState.conf.defaultCacheStorageLevel)
 
   /**
* Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala
index 8c14b5e3707..1744df83033 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala
@@ -38,25 +38,19 @@ trait BaseCacheTableExec extends LeafV2CommandExec {
 
   override def run(): Seq[InternalRow] = {
 val storageLevelKey = 

[spark] branch master updated (e1a7b84f47b -> ae112e4279f)

2023-10-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e1a7b84f47b [SPARK-45397][ML][CONNECT] Add array assembler feature 
transformer
 add ae112e4279f [SPARK-45116][SQL] Add some comment for param of 
JdbcDialect `createTable`

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala| 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (8394ebb52b9 -> e1a7b84f47b)

2023-10-11 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8394ebb52b9 [SPARK-45469][CORE][SQL][CONNECT][PYTHON] Replace 
`toIterator` with `iterator` for `IterableOnce`
 add e1a7b84f47b [SPARK-45397][ML][CONNECT] Add array assembler feature 
transformer

No new revisions were added by this update.

Summary of changes:
 .../docs/source/reference/pyspark.ml.connect.rst   |   1 +
 python/pyspark/ml/connect/base.py  |   6 +-
 python/pyspark/ml/connect/feature.py   | 156 -
 python/pyspark/ml/connect/util.py  |   6 +-
 python/pyspark/ml/param/_shared_params_code_gen.py |   7 +
 python/pyspark/ml/param/shared.py  |  22 +++
 .../ml/tests/connect/test_legacy_mode_feature.py   |  48 +++
 7 files changed, 238 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (0c1e9a5b19c -> 8394ebb52b9)

2023-10-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0c1e9a5b19c [SPARK-45500][CORE][WEBUI] Show the number of abnormally 
completed drivers in MasterPage
 add 8394ebb52b9 [SPARK-45469][CORE][SQL][CONNECT][PYTHON] Replace 
`toIterator` with `iterator` for `IterableOnce`

No new revisions were added by this update.

Summary of changes:
 .../jvm/src/main/scala/org/apache/spark/sql/functions.scala   | 2 +-
 .../src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala  | 2 +-
 core/src/main/scala/org/apache/spark/util/collection/Utils.scala  | 2 +-
 .../apache/spark/sql/catalyst/expressions/objects/objects.scala   | 4 ++--
 .../main/scala/org/apache/spark/sql/execution/GenerateExec.scala  | 8 
 .../org/apache/spark/sql/execution/arrow/ArrowConverters.scala| 2 +-
 .../spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala   | 2 +-
 .../spark/sql/execution/python/BatchEvalPythonUDTFExec.scala  | 2 +-
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala  | 2 +-
 9 files changed, 13 insertions(+), 13 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45500][CORE][WEBUI] Show the number of abnormally completed drivers in MasterPage

2023-10-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c1e9a5b19c [SPARK-45500][CORE][WEBUI] Show the number of abnormally 
completed drivers in MasterPage
0c1e9a5b19c is described below

commit 0c1e9a5b19c4cfb77c86f3871998cf2673260b06
Author: Dongjoon Hyun 
AuthorDate: Wed Oct 11 01:46:29 2023 -0700

[SPARK-45500][CORE][WEBUI] Show the number of abnormally completed drivers 
in MasterPage

### What changes were proposed in this pull request?

This PR aims to show the number of abnormaly completed drivers in 
MasterPage.

### Why are the changes needed?

In the `Completed Drivers` table, there are various exit states.

https://github.com/apache/spark/assets/9700541/ff0b33f5-c546-42e7-870c-8323e2eefded;>

We had better show the abnormally completed drivers in the top of the page.

**BEFORE**
```
Drivers: 0 Running (0 Waiting), 7 Completed
```

**AFTER**
```
Drivers: 0 Running (0 Waiting), 7 Completed (1 Killed, 4 Failed, 0 Error)
```

https://github.com/apache/spark/assets/9700541/94deab1f-b9f7-4e5b-8284-aaac4f7520df;>

### Does this PR introduce _any_ user-facing change?

Yes, this is a new UI field. However, since this is UI, there will be no 
technical issues.

### How was this patch tested?

Manual build Spark and check UI.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43328 from dongjoon-hyun/SPARK-45500.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala   | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
index cc4370ad02e..5c1887be5b8 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
@@ -153,7 +153,11 @@ private[ui] class MasterPage(parent: MasterWebUI) extends 
WebUIPage("") {
   Drivers:
 {state.activeDrivers.length} Running
 ({state.activeDrivers.count(_.state == DriverState.SUBMITTED)} 
Waiting),
-{state.completedDrivers.length} Completed 
+{state.completedDrivers.length} Completed
+({state.completedDrivers.count(_.state == DriverState.KILLED)} 
Killed,
+{state.completedDrivers.count(_.state == DriverState.FAILED)} 
Failed,
+{state.completedDrivers.count(_.state == DriverState.ERROR)} 
Error)
+  
   Status: {state.status}
 
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45480][SQL][UI] Selectable Spark Plan Node on UI

2023-10-11 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 305db078ff2 [SPARK-45480][SQL][UI] Selectable Spark Plan Node on UI
305db078ff2 is described below

commit 305db078ff25c3c29b222f197b3293dd84db3045
Author: Kent Yao 
AuthorDate: Wed Oct 11 15:49:14 2023 +0800

[SPARK-45480][SQL][UI] Selectable Spark Plan Node on UI

### What changes were proposed in this pull request?

This PR introduces selectable animation for Spark SQL Plan Node On UI, 
which lights up the selected node and its linked nodes and edges.

### Why are the changes needed?

Better UX for SQL plan visualization and debugging. Especially for large 
queries, users can now concentrate on the current node and its nearest 
neighbors to get a better understanding of node lineage.

### Does this PR introduce _any_ user-facing change?

Yes, let's see the video.

### How was this patch tested?


https://github.com/apache/spark/assets/8326978/f5ba884c-acce-46b8-8568-3ead55c91d4f

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43307 from yaooqinn/SPARK-45480.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 .../sql/execution/ui/static/spark-sql-viz.css  | 18 +
 .../spark/sql/execution/ui/static/spark-sql-viz.js | 43 ++
 2 files changed, 61 insertions(+)

diff --git 
a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css
 
b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css
index dbdbf9fbf57..d6a498e9387 100644
--- 
a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css
+++ 
b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css
@@ -57,3 +57,21 @@
 .job-url {
   word-wrap: break-word;
 }
+
+#plan-viz-graph svg g.node rect.selected {
+  fill: #E25A1CFF;
+  stroke: #317EACFF;
+  stroke-width: 2px;
+}
+
+#plan-viz-graph svg g.node rect.linked {
+  fill: #FFC106FF;
+  stroke: #317EACFF;
+  stroke-width: 2px;
+}
+
+#plan-viz-graph svg path.linked {
+  fill: #317EACFF;
+  stroke: #317EACFF;
+  stroke-width: 2px;
+}
diff --git 
a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js
 
b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js
index 96a7a7a3cc0..d4cc45a1639 100644
--- 
a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js
+++ 
b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js
@@ -47,6 +47,7 @@ function renderPlanViz() {
 .attr("ry", "5");
 
   setupLayoutForSparkPlanCluster(g, svg);
+  setupSelectionForSparkPlanNode(g);
   setupTooltipForSparkPlanNode(g);
   resizeSvg(svg);
   postprocessForAdditionalMetrics();
@@ -269,3 +270,45 @@ function togglePlanViz() { // eslint-disable-line 
no-unused-vars
 planVizContainer().style("display", "none");
   }
 }
+
+/*
+ * Light up the selected node and its linked nodes and edges.
+ */
+function setupSelectionForSparkPlanNode(g) {
+  const linkedNodes = new Map();
+  const linkedEdges = new Map();
+
+  g.edges().forEach(function (e) {
+const edge = g.edge(e);
+const from = g.node(e.v);
+const to = g.node(e.w);
+collectLinks(linkedNodes, from.id, to.id);
+collectLinks(linkedNodes, to.id, from.id);
+collectLinks(linkedEdges, from.id, edge.arrowheadId);
+collectLinks(linkedEdges, to.id, edge.arrowheadId);
+  });
+
+  linkedNodes.forEach((linkedNodes, selectNode) => {
+d3.select("#" + selectNode).on("click", () => {
+  planVizContainer().selectAll(".selected").classed("selected", false);
+  planVizContainer().selectAll(".linked").classed("linked", false);
+  d3.select("#" + selectNode + " rect").classed("selected", true);
+  linkedNodes.forEach((linkedNode) => {
+d3.select("#" + linkedNode + " rect").classed("linked", true);
+  });
+  linkedEdges.get(selectNode).forEach((linkedEdge) => {
+const arrowHead = d3.select("#" + linkedEdge + " path");
+arrowHead.classed("linked", true);
+const arrowShaft = 
$(arrowHead.node()).parents("g.edgePath").children("path");
+arrowShaft.addClass("linked");
+  });
+});
+  });
+}
+
+function collectLinks(map, key, value) {
+  if (!map.has(key)) {
+map.set(key, new Set());
+  }
+  map.get(key).add(value);
+}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45497][K8S] Add a symbolic link file `spark-examples.jar` in K8s Docker images

2023-10-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 64ac59d7129 [SPARK-45497][K8S] Add a symbolic link file 
`spark-examples.jar` in K8s Docker images
64ac59d7129 is described below

commit 64ac59d71296e631ded97b332660d3db5675623a
Author: Dongjoon Hyun 
AuthorDate: Wed Oct 11 00:35:38 2023 -0700

[SPARK-45497][K8S] Add a symbolic link file `spark-examples.jar` in K8s 
Docker images

### What changes were proposed in this pull request?

This PR aims to add a symbolic link file, `spark-examples.jar`, in the 
example jar directory.

```
$ docker run -it --rm spark:latest ls -al /opt/spark/examples/jars  | tail 
-n6
total 1620
drwxr-xr-x 1 root root4096 Oct 11 04:37 .
drwxr-xr-x 1 root root4096 Sep  9 02:08 ..
-rw-r--r-- 1 root root   78803 Sep  9 02:08 scopt_2.12-3.7.1.jar
-rw-r--r-- 1 root root 1564255 Sep  9 02:08 spark-examples_2.12-3.5.0.jar
lrwxrwxrwx 1 root root  29 Oct 11 04:37 spark-examples.jar -> 
spark-examples_2.12-3.5.0.jar
```

### Why are the changes needed?

Like PySpark example (`pi.py`), we can submit the examples without 
considering the version numbers which was painful before.
```
bin/spark-submit \
--master k8s://$K8S_MASTER \
--deploy-mode cluster \
...
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples.jar 1
```

The following is the driver pod log.
```
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit ...
--deploy-mode client
--properties-file /opt/spark/conf/spark.properties
--class org.apache.spark.examples.SparkPi
local:///opt/spark/examples/jars/spark-examples.jar 1
Files  local:///opt/spark/examples/jars/spark-examples.jar from 
/opt/spark/examples/jars/spark-examples.jar to 
/opt/spark/work-dir/./spark-examples.jar
```

### Does this PR introduce _any_ user-facing change?

No, this is an additional file.

### How was this patch tested?

Manually build the docker image and do `ls`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43324 from dongjoon-hyun/SPARK-45497.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile  | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index 02559b8de2d..b80e72c768c 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -49,6 +49,7 @@ COPY sbin /opt/spark/sbin
 COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
 COPY kubernetes/dockerfiles/spark/decom.sh /opt/
 COPY examples /opt/spark/examples
+RUN ln -s $(basename $(ls /opt/spark/examples/jars/spark-examples_*.jar)) 
/opt/spark/examples/jars/spark-examples.jar
 COPY kubernetes/tests /opt/spark/tests
 COPY data /opt/spark/data
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org