[spark] branch master updated: [MINOR][PYTHON][DOCS] Correct the type hint for `from_csv`

2022-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 39528336b86 [MINOR][PYTHON][DOCS] Correct the type hint for `from_csv`
39528336b86 is described below

commit 39528336b8672aa544c06a97d95d04d5067c4707
Author: Ruifeng Zheng 
AuthorDate: Mon Dec 12 16:45:36 2022 +0900

[MINOR][PYTHON][DOCS] Correct the type hint for `from_csv`

### What changes were proposed in this pull request?
Correct the type hint for `from_csv`

### Why are the changes needed?
`from_csv` actually does not support `StructType` for now


https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L8084-L8090

### Does this PR introduce _any_ user-facing change?
the generated doc

### How was this patch tested?
existing UT

Closes #39030 from zhengruifeng/py_from_csv_schema.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/functions.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 4d39424762b..de540c62499 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -8036,7 +8036,7 @@ def sequence(
 
 def from_csv(
 col: "ColumnOrName",
-schema: Union[StructType, Column, str],
+schema: Union[Column, str],
 options: Optional[Dict[str, str]] = None,
 ) -> Column:
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (cd2f78657ce -> b8a91da0f87)

2022-12-11 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from cd2f78657ce [SPARK-41463][SQL][TESTS] Ensure error class names contain 
only capital letters, numbers and underscores
 add b8a91da0f87 [SPARK-41484][CONNECT][PYTHON] Implement `collection` 
functions: E~M

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions.py| 942 +
 .../sql/tests/connect/test_connect_function.py | 274 ++
 2 files changed, 1216 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (5d52bb36d3b -> cd2f78657ce)

2022-12-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5d52bb36d3b [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 
8.0.31 to support `ARM64` test
 add cd2f78657ce [SPARK-41463][SQL][TESTS] Ensure error class names contain 
only capital letters, numbers and underscores

No new revisions were added by this update.

Summary of changes:
 .../test/scala/org/apache/spark/SparkThrowableSuite.scala| 12 
 1 file changed, 12 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 8.0.31 to support `ARM64` test

2022-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5d52bb36d3b [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 
8.0.31 to support `ARM64` test
5d52bb36d3b is described below

commit 5d52bb36d3ba7e619169289f8a576db5ea10990f
Author: Dongjoon Hyun 
AuthorDate: Sun Dec 11 22:23:17 2022 -0800

[SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 8.0.31 to support 
`ARM64` test

### What changes were proposed in this pull request?

This PR aims to upgrade MySQL JDBC Integration Test docker image from 
5.7.36 to 8.0.31 to support ARM64 testing.

### Why are the changes needed?

Since MySQL 8.0, MySQL provides multi-arch images. After this upgrade, we 
can run MySQL JDBC Integration Test on ARM64 environment.

![Screenshot 2022-12-11 at 8 31 48 
PM](https://user-images.githubusercontent.com/9700541/206961236-c958e6fc-1a2d-40cf-bf53-6f72a312c844.png)

### Does this PR introduce _any_ user-facing change?

No, this is a test code PR.

### How was this patch tested?

- Pass GitHub Action Ci for AMD64 architecture
- Manually run MySQLIntegrationSuites on ARM64 architecture.

I manually run this PR on Apple Silicon.
```
$ ./build/sbt -Pdocker-integration-tests "testOnly *MySQLIntegrationSuite"
[info] MySQLIntegrationSuite:
[info] - SPARK-33034: ALTER TABLE ... add new columns (4 seconds, 766 
milliseconds)
[info] - SPARK-33034: ALTER TABLE ... drop column (389 milliseconds)
[info] - SPARK-33034: ALTER TABLE ... update column type (285 milliseconds)
[info] - SPARK-33034: ALTER TABLE ... rename column (206 milliseconds)
[info] - SPARK-33034: ALTER TABLE ... update column nullability (167 
milliseconds)
[info] - CREATE TABLE with table comment (98 milliseconds)
[info] - CREATE TABLE with table property (113 milliseconds)
[info] - SPARK-36895: Test INDEX Using SQL (559 milliseconds)
[info] - SPARK-37038: Test TABLESAMPLE (0 milliseconds)
[info] - scan with aggregate push-down: VAR_POP without DISTINCT (3 
seconds, 718 milliseconds)
[info] - scan with aggregate push-down: VAR_SAMP without DISTINCT (602 
milliseconds)
[info] - scan with aggregate push-down: STDDEV_POP without DISTINCT (624 
milliseconds)
[info] - scan with aggregate push-down: STDDEV_SAMP without DISTINCT (390 
milliseconds)
[info] MySQLIntegrationSuite:
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for mllib / Test / testOnly
[info] - Basic test (432 milliseconds)
[info] - Numeric types (324 milliseconds)
[info] - Date types (871 milliseconds)
[info] - String types (302 milliseconds)
[info] - Basic write test (1 second, 341 milliseconds)
[info] - query JDBC option (334 milliseconds)
[info] Run completed in 1 minute, 15 seconds.
[info] Total number of tests run: 19
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

$ ./build/sbt -Pdocker-integration-tests "testOnly *MySQLNamespaceSuite"
[info] MySQLNamespaceSuite:
[info] - listNamespaces: basic behavior (1 second, 367 milliseconds)
[info] - Drop namespace (231 milliseconds)
[info] - Create or remove comment of namespace unsupported (186 
milliseconds)
[info] Run completed in 33 seconds, 456 milliseconds.
[info] Total number of tests run: 3
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 3, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

Closes #39029 from dongjoon-hyun/SPARK-41486.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala | 6 +++---
 .../scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  | 6 +++---
 .../scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala| 6 +++---
 .../scala/org/apache/spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala| 3 ++-
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
index 2bde50d01eb..bc202b1b832 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
@@ -26,9 +26,9 @@ import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
 import org.apache.spark.tags.DockerTest
 
 /**
- * To run this test suite for a specific version (e.g., 

[spark] branch branch-3.3 updated: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

2022-12-11 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 1c6cb350051 [SPARK-41187][CORE] LiveExecutor MemoryLeak in 
AppStatusListener when ExecutorLost happen
1c6cb350051 is described below

commit 1c6cb350051e61675b96bc1de8b9fc542bc53ce7
Author: yuanyimeng 
AuthorDate: Sun Dec 11 22:49:07 2022 -0600

[SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when 
ExecutorLost happen

### What changes were proposed in this pull request?
Ignore the SparkListenerTaskEnd with Reason "Resubmitted" in 
AppStatusListener to avoid memory leak

### Why are the changes needed?
For a long running spark thriftserver,  LiveExecutor will be accumulated in 
the deadExecutors HashMap and cause message event queue processing slowly.
For a every task, actually always sent out a `SparkListenerTaskStart` event 
and a `SparkListenerTaskEnd` event, they are always pairs. But in a executor 
lost situation, it send out event like following steps.

a) There was a pair of task start and task end event which were fired for 
the task (let us call it Tr)
b) When executor which ran Tr was lost, while stage is still running, a 
task end event with reason `Resubmitted`  is fired for Tr.
c) Subsequently, a new task start and task end will be fired for the retry 
of Tr.

The processing of the `Resubmitted` task end event in AppStatusListener can 
lead to negative `LiveStage.activeTasks` since there's no corresponding 
`SparkListenerTaskStart` event for each of them. The negative activeTasks will 
make the stage always remains in the live stage list as it can never meet the 
condition activeTasks == 0. This in turn causes the dead executor to never be 
cleaned up if that live stage's submissionTime is less than the dead executor's 
removeTime( see isExecutor [...]

Check  [SPARK-41187](https://issues.apache.org/jira/browse/SPARK-41187) for 
evidences.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT Added
Test in thriftserver env

### The way to reproduce
I try to reproduce it in spark shell, but it is a little bit handy
1.  start spark-shell , set spark.dynamicAllocation.maxExecutors=2 for 
convient
` bin/spark-shell  --driver-java-options 
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8006"`
2. run a job with shuffle
 `sc.parallelize(1 to 1000, 10).map { x => Thread.sleep(1000) ; (x % 3, x) 
}.reduceByKey((a, b) => a + b).collect()`
3. After some ShuffleMapTask finished, kill one or two executor to let 
tasks resubmitted
4. check by heap dump or debug or log

Closes #38702 from wineternity/SPARK-41187.

Authored-by: yuanyimeng 
Signed-off-by: Mridul Muralidharan gmail.com>
(cherry picked from commit 7e7bc940dcbbf918c7d571e1d27c7654ad387817)
Signed-off-by: Mridul Muralidharan 
---
 .../apache/spark/status/AppStatusListener.scala| 28 ++
 .../spark/status/AppStatusListenerSuite.scala  | 62 ++
 2 files changed, 80 insertions(+), 10 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala 
b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
index 35c43b06c28..dfc36da83f3 100644
--- a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
+++ b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
@@ -74,7 +74,7 @@ private[spark] class AppStatusListener(
   private val liveStages = new ConcurrentHashMap[(Int, Int), LiveStage]()
   private val liveJobs = new HashMap[Int, LiveJob]()
   private[spark] val liveExecutors = new HashMap[String, LiveExecutor]()
-  private val deadExecutors = new HashMap[String, LiveExecutor]()
+  private[spark] val deadExecutors = new HashMap[String, LiveExecutor]()
   private val liveTasks = new HashMap[Long, LiveTask]()
   private val liveRDDs = new HashMap[Int, LiveRDD]()
   private val pools = new HashMap[String, SchedulerPool]()
@@ -673,22 +673,30 @@ private[spark] class AppStatusListener(
   delta
 }.orNull
 
-val (completedDelta, failedDelta, killedDelta) = event.reason match {
+// SPARK-41187: For `SparkListenerTaskEnd` with `Resubmitted` reason, 
which is raised by
+// executor lost, it can lead to negative `LiveStage.activeTasks` since 
there's no
+// corresponding `SparkListenerTaskStart` event for each of them. The 
negative activeTasks
+// will make the stage always remains in the live stage list as it can 
never meet the
+// condition activeTasks == 0. This in turn causes the dead executor to 
never be retained
+// if that live stage's submissionTime is less than the dead executor's 
removeTime.
+val (completedDelta, 

[spark] branch master updated: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

2022-12-11 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7e7bc940dcb [SPARK-41187][CORE] LiveExecutor MemoryLeak in 
AppStatusListener when ExecutorLost happen
7e7bc940dcb is described below

commit 7e7bc940dcbbf918c7d571e1d27c7654ad387817
Author: yuanyimeng 
AuthorDate: Sun Dec 11 22:49:07 2022 -0600

[SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when 
ExecutorLost happen

### What changes were proposed in this pull request?
Ignore the SparkListenerTaskEnd with Reason "Resubmitted" in 
AppStatusListener to avoid memory leak

### Why are the changes needed?
For a long running spark thriftserver,  LiveExecutor will be accumulated in 
the deadExecutors HashMap and cause message event queue processing slowly.
For a every task, actually always sent out a `SparkListenerTaskStart` event 
and a `SparkListenerTaskEnd` event, they are always pairs. But in a executor 
lost situation, it send out event like following steps.

a) There was a pair of task start and task end event which were fired for 
the task (let us call it Tr)
b) When executor which ran Tr was lost, while stage is still running, a 
task end event with reason `Resubmitted`  is fired for Tr.
c) Subsequently, a new task start and task end will be fired for the retry 
of Tr.

The processing of the `Resubmitted` task end event in AppStatusListener can 
lead to negative `LiveStage.activeTasks` since there's no corresponding 
`SparkListenerTaskStart` event for each of them. The negative activeTasks will 
make the stage always remains in the live stage list as it can never meet the 
condition activeTasks == 0. This in turn causes the dead executor to never be 
cleaned up if that live stage's submissionTime is less than the dead executor's 
removeTime( see isExecutor [...]

Check  [SPARK-41187](https://issues.apache.org/jira/browse/SPARK-41187) for 
evidences.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT Added
Test in thriftserver env

### The way to reproduce
I try to reproduce it in spark shell, but it is a little bit handy
1.  start spark-shell , set spark.dynamicAllocation.maxExecutors=2 for 
convient
` bin/spark-shell  --driver-java-options 
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8006"`
2. run a job with shuffle
 `sc.parallelize(1 to 1000, 10).map { x => Thread.sleep(1000) ; (x % 3, x) 
}.reduceByKey((a, b) => a + b).collect()`
3. After some ShuffleMapTask finished, kill one or two executor to let 
tasks resubmitted
4. check by heap dump or debug or log

Closes #38702 from wineternity/SPARK-41187.

Authored-by: yuanyimeng 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../apache/spark/status/AppStatusListener.scala| 28 ++
 .../spark/status/AppStatusListenerSuite.scala  | 62 ++
 2 files changed, 80 insertions(+), 10 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala 
b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
index ea028dfd11d..287bf2165c9 100644
--- a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
+++ b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
@@ -74,7 +74,7 @@ private[spark] class AppStatusListener(
   private val liveStages = new ConcurrentHashMap[(Int, Int), LiveStage]()
   private val liveJobs = new HashMap[Int, LiveJob]()
   private[spark] val liveExecutors = new HashMap[String, LiveExecutor]()
-  private val deadExecutors = new HashMap[String, LiveExecutor]()
+  private[spark] val deadExecutors = new HashMap[String, LiveExecutor]()
   private val liveTasks = new HashMap[Long, LiveTask]()
   private val liveRDDs = new HashMap[Int, LiveRDD]()
   private val pools = new HashMap[String, SchedulerPool]()
@@ -674,22 +674,30 @@ private[spark] class AppStatusListener(
   delta
 }.orNull
 
-val (completedDelta, failedDelta, killedDelta) = event.reason match {
+// SPARK-41187: For `SparkListenerTaskEnd` with `Resubmitted` reason, 
which is raised by
+// executor lost, it can lead to negative `LiveStage.activeTasks` since 
there's no
+// corresponding `SparkListenerTaskStart` event for each of them. The 
negative activeTasks
+// will make the stage always remains in the live stage list as it can 
never meet the
+// condition activeTasks == 0. This in turn causes the dead executor to 
never be retained
+// if that live stage's submissionTime is less than the dead executor's 
removeTime.
+val (completedDelta, failedDelta, killedDelta, activeDelta) = event.reason 
match {
   case Success =>
-(1, 0, 0)
+(1, 0, 0, 1)

[spark] branch master updated: [SPARK-41439][CONNECT][PYTHON][FOLLOWUP] Make unpivot of `connect/dataframe.py` consistent with `pyspark/dataframe.py`

2022-12-11 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef9c8e0d045 [SPARK-41439][CONNECT][PYTHON][FOLLOWUP] Make unpivot of 
`connect/dataframe.py` consistent with `pyspark/dataframe.py`
ef9c8e0d045 is described below

commit ef9c8e0d045576fb325ef337319fe6d59b7ce858
Author: Jiaan Geng 
AuthorDate: Mon Dec 12 08:34:21 2022 +0800

[SPARK-41439][CONNECT][PYTHON][FOLLOWUP] Make unpivot of 
`connect/dataframe.py` consistent with `pyspark/dataframe.py`

### What changes were proposed in this pull request?
This PR lets `unpivot` of `connect/dataframe.py` consistent with 
`pyspark/dataframe.py` and adds test cases for connect's `unpivot`.

This PR follows up https://github.com/apache/spark/pull/38973

### Why are the changes needed?
1. Lets `unpivot` of `connect/dataframe.py` consistent with 
`pyspark/dataframe.py`
2. Add test cases for connect's `unpivot`.

### Does this PR introduce _any_ user-facing change?
'No'. New API

### How was this patch tested?
New test cases.

Closes #39019 from beliefer/SPARK-41439_followup.

Authored-by: Jiaan Geng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py| 22 ++---
 .../sql/tests/connect/test_connect_basic.py| 23 ++
 .../sql/tests/connect/test_connect_plan_only.py|  2 +-
 3 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 4c1956cc577..08d48bb11f2 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -826,8 +826,8 @@ class DataFrame(object):
 
 def unpivot(
 self,
-ids: List["ColumnOrName"],
-values: List["ColumnOrName"],
+ids: Optional[Union["ColumnOrName", List["ColumnOrName"], 
Tuple["ColumnOrName", ...]]],
+values: Optional[Union["ColumnOrName", List["ColumnOrName"], 
Tuple["ColumnOrName", ...]]],
 variableColumnName: str,
 valueColumnName: str,
 ) -> "DataFrame":
@@ -852,8 +852,24 @@ class DataFrame(object):
 ---
 :class:`DataFrame`
 """
+
+def to_jcols(
+cols: Optional[Union["ColumnOrName", List["ColumnOrName"], 
Tuple["ColumnOrName", ...]]]
+) -> List["ColumnOrName"]:
+if cols is None:
+lst = []
+elif isinstance(cols, tuple):
+lst = list(cols)
+elif isinstance(cols, list):
+lst = cols
+else:
+lst = [cols]
+return lst
+
 return DataFrame.withPlan(
-plan.Unpivot(self._plan, ids, values, variableColumnName, 
valueColumnName),
+plan.Unpivot(
+self._plan, to_jcols(ids), to_jcols(values), 
variableColumnName, valueColumnName
+),
 self._session,
 )
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 9d49cfd321c..6dabbaedffe 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -783,6 +783,29 @@ class SparkConnectTests(SparkConnectSQLTestCase):
 """Cannot resolve column name "x" among (a, b, c)""", 
str(context.exception)
 )
 
+def test_unpivot(self):
+self.assert_eq(
+self.connect.read.table(self.tbl_name)
+.filter("id > 3")
+.unpivot(["id"], ["name"], "variable", "value")
+.toPandas(),
+self.spark.read.table(self.tbl_name)
+.filter("id > 3")
+.unpivot(["id"], ["name"], "variable", "value")
+.toPandas(),
+)
+
+self.assert_eq(
+self.connect.read.table(self.tbl_name)
+.filter("id > 3")
+.unpivot("id", None, "variable", "value")
+.toPandas(),
+self.spark.read.table(self.tbl_name)
+.filter("id > 3")
+.unpivot("id", None, "variable", "value")
+.toPandas(),
+)
+
 def test_with_columns(self):
 # SPARK-41256: test withColumn(s).
 self.assert_eq(
diff --git a/python/pyspark/sql/tests/connect/test_connect_plan_only.py 
b/python/pyspark/sql/tests/connect/test_connect_plan_only.py
index 83e21e42bad..e0cd54195f3 100644
--- a/python/pyspark/sql/tests/connect/test_connect_plan_only.py
+++ b/python/pyspark/sql/tests/connect/test_connect_plan_only.py
@@ -189,7 +189,7 @@ class SparkConnectTestsPlanOnly(PlanOnlyTestFixture):
 
 plan = (
 df.filter(df.col_name > 3)
-.unpivot(["id"], [], "variable", "value")
+   

[spark] branch master updated (a4042619633 -> 7cf348c56cc)

2022-12-11 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a4042619633 [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to 
K8s document
 add 7cf348c56cc [SPARK-41477][CONNECT][PYTHON] Correctly infer the 
datatype of literal integers

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/column.py   | 13 +-
 .../sql/tests/connect/test_connect_column.py   | 51 ++
 .../connect/test_connect_column_expressions.py | 13 +++---
 3 files changed, 71 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to K8s document

2022-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a4042619633 [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to 
K8s document
a4042619633 is described below

commit a40426196333e2dd03660b007cd714de8b85052a
Author: Dongjoon Hyun 
AuthorDate: Sun Dec 11 13:06:30 2022 -0800

[SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to K8s document

### What changes were proposed in this pull request?

This PR aims to add `IPv4 and IPv6` section to K8s document.

### Why are the changes needed?

To document IPv6-only environment more explicitly.

### Does this PR introduce _any_ user-facing change?

No. This is a documentation PR.

### How was this patch tested?

Manually review.

Closes #39022 from dongjoon-hyun/SPARK-41479.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/running-on-kubernetes.md | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index 08580a77eb2..21c81c508e1 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -204,6 +204,26 @@ When this property is set, it's highly recommended to make 
it unique across all
 
 Use the exact prefix `spark.kubernetes.authenticate` for Kubernetes 
authentication parameters in client mode.
 
+## IPv4 and IPv6
+
+Starting with 3.4.0, Spark supports additionally IPv6-only environment via
+[IPv4/IPv6 dual-stack 
network](https://kubernetes.io/docs/concepts/services-networking/dual-stack/)
+feature which enables the allocation of both IPv4 and IPv6 addresses to Pods 
and Services.
+According to the K8s cluster capability, 
`spark.kubernetes.driver.service.ipFamilyPolicy` and
+`spark.kubernetes.driver.service.ipFamilies` can be one of `SingleStack`, 
`PreferDualStack`,
+and `RequireDualStack` and one of `IPv4`, `IPv6`, `IPv4,IPv6`, and `IPv6,IPv4` 
respectively.
+By default, Spark uses 
`spark.kubernetes.driver.service.ipFamilyPolicy=SingleStack` and
+`spark.kubernetes.driver.service.ipFamilies=IPv4`.
+
+To use only `IPv6`, you can submit your jobs with the following.
+```bash
+...
+--conf spark.kubernetes.driver.service.ipFamilies=IPv6 \
+```
+
+In `DualStack` environment, you may need `java.net.preferIPv6Addresses=true` 
for JVM
+and `SPARK_PREFER_IPV6=true` for Python additionally to use `IPv6`.
+
 ## Dependency Management
 
 If your application's dependencies are all hosted in remote locations like 
HDFS or HTTP servers, they may be referred to
@@ -1418,7 +1438,8 @@ See the [configuration page](configuration.html) for 
information on Spark config
   spark.kubernetes.driver.service.ipFamilyPolicy
   SingleStack
   
-K8s IP Family Policy for Driver Service.
+K8s IP Family Policy for Driver Service. Valid values are
+SingleStack, PreferDualStack, and 
RequireDualStack.
   
   3.4.0
 
@@ -1426,7 +1447,8 @@ See the [configuration page](configuration.html) for 
information on Spark config
   spark.kubernetes.driver.service.ipFamilies
   IPv4
   
-A list of IP families for K8s Driver Service.
+A list of IP families for K8s Driver Service. Valid values are
+IPv4 and IPv6.
   
   3.4.0
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica…

2022-12-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f92c827acab [SPARK-41008][MLLIB] Follow-up isotonic regression 
features deduplica…
f92c827acab is described below

commit f92c827acabccf547d5a1dff4f7ec371bc370230
Author: Ahmed Mahran 
AuthorDate: Sun Dec 11 15:01:15 2022 -0600

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica…

### What changes were proposed in this pull request?

A follow-up on https://github.com/apache/spark/pull/38966 to update 
relevant documentation and remove redundant sort key.

### Why are the changes needed?

For isotonic regression, another method for breaking ties of repeated 
features was introduced in https://github.com/apache/spark/pull/38966. This 
will aggregate points having the same feature value by computing the weighted 
average of the labels.
- This only requires points to be sorted by features instead of features 
and labels. So, we should remove label as a secondary sorting key.
- Isotonic regression documentation needs to be updated to reflect the new 
behavior.

### Does this PR introduce _any_ user-facing change?

Isotonic regression documentation update. The documentation described the 
behavior of the algorithm when there are points in the input with repeated 
features. Since this behavior has changed, documentation needs to describe the 
new behavior.

### How was this patch tested?

Existing tests passed. No need to add new tests since existing tests are 
already comprehensive.

srowen

Closes #38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up.

Authored-by: Ahmed Mahran 
Signed-off-by: Sean Owen 
---
 docs/mllib-isotonic-regression.md  | 18 ++---
 .../mllib/regression/IsotonicRegression.scala  | 82 ++
 .../mllib/regression/IsotonicRegressionSuite.scala | 29 +---
 3 files changed, 67 insertions(+), 62 deletions(-)

diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 95be32a819e..711e828bd80 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -43,7 +43,14 @@ best fitting the original data points.
 which uses an approach to
 [parallelizing isotonic 
regression](https://doi.org/10.1007/978-3-642-99789-1_10).
 The training input is an RDD of tuples of three double values that represent
-label, feature and weight in this order. Additionally, IsotonicRegression 
algorithm has one
+label, feature and weight in this order. In case there are multiple tuples with
+the same feature then these tuples are aggregated into a single tuple as 
follows:
+
+* Aggregated label is the weighted average of all labels.
+* Aggregated feature is the unique feature value.
+* Aggregated weight is the sum of all weights.
+
+Additionally, IsotonicRegression algorithm has one
 optional parameter called $isotonic$ defaulting to true.
 This argument specifies if the isotonic regression is
 isotonic (monotonically increasing) or antitonic (monotonically decreasing).
@@ -53,17 +60,12 @@ labels for both known and unknown features. The result of 
isotonic regression
 is treated as piecewise linear function. The rules for prediction therefore 
are:
 
 * If the prediction input exactly matches a training feature
-  then associated prediction is returned. In case there are multiple 
predictions with the same
-  feature then one of them is returned. Which one is undefined
-  (same as java.util.Arrays.binarySearch).
+  then associated prediction is returned.
 * If the prediction input is lower or higher than all training features
   then prediction with lowest or highest feature is returned respectively.
-  In case there are multiple predictions with the same feature
-  then the lowest or highest is returned respectively.
 * If the prediction input falls between two training features then prediction 
is treated
   as piecewise linear function and interpolated value is calculated from the
-  predictions of the two closest features. In case there are multiple values
-  with the same feature then the same rules as in previous point are used.
+  predictions of the two closest features.
 
 ### Examples
 
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala
index 0b2bf147501..fbf0dc9c357 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala
@@ -23,7 +23,6 @@ import java.util.Arrays.binarySearch
 import scala.collection.JavaConverters._
 import scala.collection.mutable.ArrayBuffer
 
-import