[spark] branch master updated: [MINOR][PYTHON][DOCS] Correct the type hint for `from_csv`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 39528336b86 [MINOR][PYTHON][DOCS] Correct the type hint for `from_csv` 39528336b86 is described below commit 39528336b8672aa544c06a97d95d04d5067c4707 Author: Ruifeng Zheng AuthorDate: Mon Dec 12 16:45:36 2022 +0900 [MINOR][PYTHON][DOCS] Correct the type hint for `from_csv` ### What changes were proposed in this pull request? Correct the type hint for `from_csv` ### Why are the changes needed? `from_csv` actually does not support `StructType` for now https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L8084-L8090 ### Does this PR introduce _any_ user-facing change? the generated doc ### How was this patch tested? existing UT Closes #39030 from zhengruifeng/py_from_csv_schema. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/functions.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 4d39424762b..de540c62499 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -8036,7 +8036,7 @@ def sequence( def from_csv( col: "ColumnOrName", -schema: Union[StructType, Column, str], +schema: Union[Column, str], options: Optional[Dict[str, str]] = None, ) -> Column: """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (cd2f78657ce -> b8a91da0f87)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from cd2f78657ce [SPARK-41463][SQL][TESTS] Ensure error class names contain only capital letters, numbers and underscores add b8a91da0f87 [SPARK-41484][CONNECT][PYTHON] Implement `collection` functions: E~M No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/functions.py| 942 + .../sql/tests/connect/test_connect_function.py | 274 ++ 2 files changed, 1216 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5d52bb36d3b -> cd2f78657ce)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5d52bb36d3b [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 8.0.31 to support `ARM64` test add cd2f78657ce [SPARK-41463][SQL][TESTS] Ensure error class names contain only capital letters, numbers and underscores No new revisions were added by this update. Summary of changes: .../test/scala/org/apache/spark/SparkThrowableSuite.scala| 12 1 file changed, 12 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 8.0.31 to support `ARM64` test
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5d52bb36d3b [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 8.0.31 to support `ARM64` test 5d52bb36d3b is described below commit 5d52bb36d3ba7e619169289f8a576db5ea10990f Author: Dongjoon Hyun AuthorDate: Sun Dec 11 22:23:17 2022 -0800 [SPARK-41486][SQL][TESTS] Upgrade `MySQL` docker image to 8.0.31 to support `ARM64` test ### What changes were proposed in this pull request? This PR aims to upgrade MySQL JDBC Integration Test docker image from 5.7.36 to 8.0.31 to support ARM64 testing. ### Why are the changes needed? Since MySQL 8.0, MySQL provides multi-arch images. After this upgrade, we can run MySQL JDBC Integration Test on ARM64 environment. ![Screenshot 2022-12-11 at 8 31 48 PM](https://user-images.githubusercontent.com/9700541/206961236-c958e6fc-1a2d-40cf-bf53-6f72a312c844.png) ### Does this PR introduce _any_ user-facing change? No, this is a test code PR. ### How was this patch tested? - Pass GitHub Action Ci for AMD64 architecture - Manually run MySQLIntegrationSuites on ARM64 architecture. I manually run this PR on Apple Silicon. ``` $ ./build/sbt -Pdocker-integration-tests "testOnly *MySQLIntegrationSuite" [info] MySQLIntegrationSuite: [info] - SPARK-33034: ALTER TABLE ... add new columns (4 seconds, 766 milliseconds) [info] - SPARK-33034: ALTER TABLE ... drop column (389 milliseconds) [info] - SPARK-33034: ALTER TABLE ... update column type (285 milliseconds) [info] - SPARK-33034: ALTER TABLE ... rename column (206 milliseconds) [info] - SPARK-33034: ALTER TABLE ... update column nullability (167 milliseconds) [info] - CREATE TABLE with table comment (98 milliseconds) [info] - CREATE TABLE with table property (113 milliseconds) [info] - SPARK-36895: Test INDEX Using SQL (559 milliseconds) [info] - SPARK-37038: Test TABLESAMPLE (0 milliseconds) [info] - scan with aggregate push-down: VAR_POP without DISTINCT (3 seconds, 718 milliseconds) [info] - scan with aggregate push-down: VAR_SAMP without DISTINCT (602 milliseconds) [info] - scan with aggregate push-down: STDDEV_POP without DISTINCT (624 milliseconds) [info] - scan with aggregate push-down: STDDEV_SAMP without DISTINCT (390 milliseconds) [info] MySQLIntegrationSuite: [info] Passed: Total 0, Failed 0, Errors 0, Passed 0 [info] No tests to run for mllib / Test / testOnly [info] - Basic test (432 milliseconds) [info] - Numeric types (324 milliseconds) [info] - Date types (871 milliseconds) [info] - String types (302 milliseconds) [info] - Basic write test (1 second, 341 milliseconds) [info] - query JDBC option (334 milliseconds) [info] Run completed in 1 minute, 15 seconds. [info] Total number of tests run: 19 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. $ ./build/sbt -Pdocker-integration-tests "testOnly *MySQLNamespaceSuite" [info] MySQLNamespaceSuite: [info] - listNamespaces: basic behavior (1 second, 367 milliseconds) [info] - Drop namespace (231 milliseconds) [info] - Create or remove comment of namespace unsupported (186 milliseconds) [info] Run completed in 33 seconds, 456 milliseconds. [info] Total number of tests run: 3 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 3, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #39029 from dongjoon-hyun/SPARK-41486. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala | 6 +++--- .../scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala | 6 +++--- .../scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala| 6 +++--- .../scala/org/apache/spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala| 3 ++- 4 files changed, 11 insertions(+), 10 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala index 2bde50d01eb..bc202b1b832 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala @@ -26,9 +26,9 @@ import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._ import org.apache.spark.tags.DockerTest /** - * To run this test suite for a specific version (e.g.,
[spark] branch branch-3.3 updated: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 1c6cb350051 [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen 1c6cb350051 is described below commit 1c6cb350051e61675b96bc1de8b9fc542bc53ce7 Author: yuanyimeng AuthorDate: Sun Dec 11 22:49:07 2022 -0600 [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen ### What changes were proposed in this pull request? Ignore the SparkListenerTaskEnd with Reason "Resubmitted" in AppStatusListener to avoid memory leak ### Why are the changes needed? For a long running spark thriftserver, LiveExecutor will be accumulated in the deadExecutors HashMap and cause message event queue processing slowly. For a every task, actually always sent out a `SparkListenerTaskStart` event and a `SparkListenerTaskEnd` event, they are always pairs. But in a executor lost situation, it send out event like following steps. a) There was a pair of task start and task end event which were fired for the task (let us call it Tr) b) When executor which ran Tr was lost, while stage is still running, a task end event with reason `Resubmitted` is fired for Tr. c) Subsequently, a new task start and task end will be fired for the retry of Tr. The processing of the `Resubmitted` task end event in AppStatusListener can lead to negative `LiveStage.activeTasks` since there's no corresponding `SparkListenerTaskStart` event for each of them. The negative activeTasks will make the stage always remains in the live stage list as it can never meet the condition activeTasks == 0. This in turn causes the dead executor to never be cleaned up if that live stage's submissionTime is less than the dead executor's removeTime( see isExecutor [...] Check [SPARK-41187](https://issues.apache.org/jira/browse/SPARK-41187) for evidences. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Added Test in thriftserver env ### The way to reproduce I try to reproduce it in spark shell, but it is a little bit handy 1. start spark-shell , set spark.dynamicAllocation.maxExecutors=2 for convient ` bin/spark-shell --driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8006"` 2. run a job with shuffle `sc.parallelize(1 to 1000, 10).map { x => Thread.sleep(1000) ; (x % 3, x) }.reduceByKey((a, b) => a + b).collect()` 3. After some ShuffleMapTask finished, kill one or two executor to let tasks resubmitted 4. check by heap dump or debug or log Closes #38702 from wineternity/SPARK-41187. Authored-by: yuanyimeng Signed-off-by: Mridul Muralidharan gmail.com> (cherry picked from commit 7e7bc940dcbbf918c7d571e1d27c7654ad387817) Signed-off-by: Mridul Muralidharan --- .../apache/spark/status/AppStatusListener.scala| 28 ++ .../spark/status/AppStatusListenerSuite.scala | 62 ++ 2 files changed, 80 insertions(+), 10 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala index 35c43b06c28..dfc36da83f3 100644 --- a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala +++ b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala @@ -74,7 +74,7 @@ private[spark] class AppStatusListener( private val liveStages = new ConcurrentHashMap[(Int, Int), LiveStage]() private val liveJobs = new HashMap[Int, LiveJob]() private[spark] val liveExecutors = new HashMap[String, LiveExecutor]() - private val deadExecutors = new HashMap[String, LiveExecutor]() + private[spark] val deadExecutors = new HashMap[String, LiveExecutor]() private val liveTasks = new HashMap[Long, LiveTask]() private val liveRDDs = new HashMap[Int, LiveRDD]() private val pools = new HashMap[String, SchedulerPool]() @@ -673,22 +673,30 @@ private[spark] class AppStatusListener( delta }.orNull -val (completedDelta, failedDelta, killedDelta) = event.reason match { +// SPARK-41187: For `SparkListenerTaskEnd` with `Resubmitted` reason, which is raised by +// executor lost, it can lead to negative `LiveStage.activeTasks` since there's no +// corresponding `SparkListenerTaskStart` event for each of them. The negative activeTasks +// will make the stage always remains in the live stage list as it can never meet the +// condition activeTasks == 0. This in turn causes the dead executor to never be retained +// if that live stage's submissionTime is less than the dead executor's removeTime. +val (completedDelta,
[spark] branch master updated: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7e7bc940dcb [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen 7e7bc940dcb is described below commit 7e7bc940dcbbf918c7d571e1d27c7654ad387817 Author: yuanyimeng AuthorDate: Sun Dec 11 22:49:07 2022 -0600 [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen ### What changes were proposed in this pull request? Ignore the SparkListenerTaskEnd with Reason "Resubmitted" in AppStatusListener to avoid memory leak ### Why are the changes needed? For a long running spark thriftserver, LiveExecutor will be accumulated in the deadExecutors HashMap and cause message event queue processing slowly. For a every task, actually always sent out a `SparkListenerTaskStart` event and a `SparkListenerTaskEnd` event, they are always pairs. But in a executor lost situation, it send out event like following steps. a) There was a pair of task start and task end event which were fired for the task (let us call it Tr) b) When executor which ran Tr was lost, while stage is still running, a task end event with reason `Resubmitted` is fired for Tr. c) Subsequently, a new task start and task end will be fired for the retry of Tr. The processing of the `Resubmitted` task end event in AppStatusListener can lead to negative `LiveStage.activeTasks` since there's no corresponding `SparkListenerTaskStart` event for each of them. The negative activeTasks will make the stage always remains in the live stage list as it can never meet the condition activeTasks == 0. This in turn causes the dead executor to never be cleaned up if that live stage's submissionTime is less than the dead executor's removeTime( see isExecutor [...] Check [SPARK-41187](https://issues.apache.org/jira/browse/SPARK-41187) for evidences. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Added Test in thriftserver env ### The way to reproduce I try to reproduce it in spark shell, but it is a little bit handy 1. start spark-shell , set spark.dynamicAllocation.maxExecutors=2 for convient ` bin/spark-shell --driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8006"` 2. run a job with shuffle `sc.parallelize(1 to 1000, 10).map { x => Thread.sleep(1000) ; (x % 3, x) }.reduceByKey((a, b) => a + b).collect()` 3. After some ShuffleMapTask finished, kill one or two executor to let tasks resubmitted 4. check by heap dump or debug or log Closes #38702 from wineternity/SPARK-41187. Authored-by: yuanyimeng Signed-off-by: Mridul Muralidharan gmail.com> --- .../apache/spark/status/AppStatusListener.scala| 28 ++ .../spark/status/AppStatusListenerSuite.scala | 62 ++ 2 files changed, 80 insertions(+), 10 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala index ea028dfd11d..287bf2165c9 100644 --- a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala +++ b/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala @@ -74,7 +74,7 @@ private[spark] class AppStatusListener( private val liveStages = new ConcurrentHashMap[(Int, Int), LiveStage]() private val liveJobs = new HashMap[Int, LiveJob]() private[spark] val liveExecutors = new HashMap[String, LiveExecutor]() - private val deadExecutors = new HashMap[String, LiveExecutor]() + private[spark] val deadExecutors = new HashMap[String, LiveExecutor]() private val liveTasks = new HashMap[Long, LiveTask]() private val liveRDDs = new HashMap[Int, LiveRDD]() private val pools = new HashMap[String, SchedulerPool]() @@ -674,22 +674,30 @@ private[spark] class AppStatusListener( delta }.orNull -val (completedDelta, failedDelta, killedDelta) = event.reason match { +// SPARK-41187: For `SparkListenerTaskEnd` with `Resubmitted` reason, which is raised by +// executor lost, it can lead to negative `LiveStage.activeTasks` since there's no +// corresponding `SparkListenerTaskStart` event for each of them. The negative activeTasks +// will make the stage always remains in the live stage list as it can never meet the +// condition activeTasks == 0. This in turn causes the dead executor to never be retained +// if that live stage's submissionTime is less than the dead executor's removeTime. +val (completedDelta, failedDelta, killedDelta, activeDelta) = event.reason match { case Success => -(1, 0, 0) +(1, 0, 0, 1)
[spark] branch master updated: [SPARK-41439][CONNECT][PYTHON][FOLLOWUP] Make unpivot of `connect/dataframe.py` consistent with `pyspark/dataframe.py`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ef9c8e0d045 [SPARK-41439][CONNECT][PYTHON][FOLLOWUP] Make unpivot of `connect/dataframe.py` consistent with `pyspark/dataframe.py` ef9c8e0d045 is described below commit ef9c8e0d045576fb325ef337319fe6d59b7ce858 Author: Jiaan Geng AuthorDate: Mon Dec 12 08:34:21 2022 +0800 [SPARK-41439][CONNECT][PYTHON][FOLLOWUP] Make unpivot of `connect/dataframe.py` consistent with `pyspark/dataframe.py` ### What changes were proposed in this pull request? This PR lets `unpivot` of `connect/dataframe.py` consistent with `pyspark/dataframe.py` and adds test cases for connect's `unpivot`. This PR follows up https://github.com/apache/spark/pull/38973 ### Why are the changes needed? 1. Lets `unpivot` of `connect/dataframe.py` consistent with `pyspark/dataframe.py` 2. Add test cases for connect's `unpivot`. ### Does this PR introduce _any_ user-facing change? 'No'. New API ### How was this patch tested? New test cases. Closes #39019 from beliefer/SPARK-41439_followup. Authored-by: Jiaan Geng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/connect/dataframe.py| 22 ++--- .../sql/tests/connect/test_connect_basic.py| 23 ++ .../sql/tests/connect/test_connect_plan_only.py| 2 +- 3 files changed, 43 insertions(+), 4 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 4c1956cc577..08d48bb11f2 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -826,8 +826,8 @@ class DataFrame(object): def unpivot( self, -ids: List["ColumnOrName"], -values: List["ColumnOrName"], +ids: Optional[Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]]], +values: Optional[Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]]], variableColumnName: str, valueColumnName: str, ) -> "DataFrame": @@ -852,8 +852,24 @@ class DataFrame(object): --- :class:`DataFrame` """ + +def to_jcols( +cols: Optional[Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]]] +) -> List["ColumnOrName"]: +if cols is None: +lst = [] +elif isinstance(cols, tuple): +lst = list(cols) +elif isinstance(cols, list): +lst = cols +else: +lst = [cols] +return lst + return DataFrame.withPlan( -plan.Unpivot(self._plan, ids, values, variableColumnName, valueColumnName), +plan.Unpivot( +self._plan, to_jcols(ids), to_jcols(values), variableColumnName, valueColumnName +), self._session, ) diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index 9d49cfd321c..6dabbaedffe 100644 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_connect_basic.py @@ -783,6 +783,29 @@ class SparkConnectTests(SparkConnectSQLTestCase): """Cannot resolve column name "x" among (a, b, c)""", str(context.exception) ) +def test_unpivot(self): +self.assert_eq( +self.connect.read.table(self.tbl_name) +.filter("id > 3") +.unpivot(["id"], ["name"], "variable", "value") +.toPandas(), +self.spark.read.table(self.tbl_name) +.filter("id > 3") +.unpivot(["id"], ["name"], "variable", "value") +.toPandas(), +) + +self.assert_eq( +self.connect.read.table(self.tbl_name) +.filter("id > 3") +.unpivot("id", None, "variable", "value") +.toPandas(), +self.spark.read.table(self.tbl_name) +.filter("id > 3") +.unpivot("id", None, "variable", "value") +.toPandas(), +) + def test_with_columns(self): # SPARK-41256: test withColumn(s). self.assert_eq( diff --git a/python/pyspark/sql/tests/connect/test_connect_plan_only.py b/python/pyspark/sql/tests/connect/test_connect_plan_only.py index 83e21e42bad..e0cd54195f3 100644 --- a/python/pyspark/sql/tests/connect/test_connect_plan_only.py +++ b/python/pyspark/sql/tests/connect/test_connect_plan_only.py @@ -189,7 +189,7 @@ class SparkConnectTestsPlanOnly(PlanOnlyTestFixture): plan = ( df.filter(df.col_name > 3) -.unpivot(["id"], [], "variable", "value") +
[spark] branch master updated (a4042619633 -> 7cf348c56cc)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a4042619633 [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to K8s document add 7cf348c56cc [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/column.py | 13 +- .../sql/tests/connect/test_connect_column.py | 51 ++ .../connect/test_connect_column_expressions.py | 13 +++--- 3 files changed, 71 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to K8s document
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a4042619633 [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to K8s document a4042619633 is described below commit a40426196333e2dd03660b007cd714de8b85052a Author: Dongjoon Hyun AuthorDate: Sun Dec 11 13:06:30 2022 -0800 [SPARK-41479][K8S][DOCS] Add `IPv4 and IPv6` section to K8s document ### What changes were proposed in this pull request? This PR aims to add `IPv4 and IPv6` section to K8s document. ### Why are the changes needed? To document IPv6-only environment more explicitly. ### Does this PR introduce _any_ user-facing change? No. This is a documentation PR. ### How was this patch tested? Manually review. Closes #39022 from dongjoon-hyun/SPARK-41479. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- docs/running-on-kubernetes.md | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 08580a77eb2..21c81c508e1 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -204,6 +204,26 @@ When this property is set, it's highly recommended to make it unique across all Use the exact prefix `spark.kubernetes.authenticate` for Kubernetes authentication parameters in client mode. +## IPv4 and IPv6 + +Starting with 3.4.0, Spark supports additionally IPv6-only environment via +[IPv4/IPv6 dual-stack network](https://kubernetes.io/docs/concepts/services-networking/dual-stack/) +feature which enables the allocation of both IPv4 and IPv6 addresses to Pods and Services. +According to the K8s cluster capability, `spark.kubernetes.driver.service.ipFamilyPolicy` and +`spark.kubernetes.driver.service.ipFamilies` can be one of `SingleStack`, `PreferDualStack`, +and `RequireDualStack` and one of `IPv4`, `IPv6`, `IPv4,IPv6`, and `IPv6,IPv4` respectively. +By default, Spark uses `spark.kubernetes.driver.service.ipFamilyPolicy=SingleStack` and +`spark.kubernetes.driver.service.ipFamilies=IPv4`. + +To use only `IPv6`, you can submit your jobs with the following. +```bash +... +--conf spark.kubernetes.driver.service.ipFamilies=IPv6 \ +``` + +In `DualStack` environment, you may need `java.net.preferIPv6Addresses=true` for JVM +and `SPARK_PREFER_IPV6=true` for Python additionally to use `IPv6`. + ## Dependency Management If your application's dependencies are all hosted in remote locations like HDFS or HTTP servers, they may be referred to @@ -1418,7 +1438,8 @@ See the [configuration page](configuration.html) for information on Spark config spark.kubernetes.driver.service.ipFamilyPolicy SingleStack -K8s IP Family Policy for Driver Service. +K8s IP Family Policy for Driver Service. Valid values are +SingleStack, PreferDualStack, and RequireDualStack. 3.4.0 @@ -1426,7 +1447,8 @@ See the [configuration page](configuration.html) for information on Spark config spark.kubernetes.driver.service.ipFamilies IPv4 -A list of IP families for K8s Driver Service. +A list of IP families for K8s Driver Service. Valid values are +IPv4 and IPv6. 3.4.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica…
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f92c827acab [SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… f92c827acab is described below commit f92c827acabccf547d5a1dff4f7ec371bc370230 Author: Ahmed Mahran AuthorDate: Sun Dec 11 15:01:15 2022 -0600 [SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… ### What changes were proposed in this pull request? A follow-up on https://github.com/apache/spark/pull/38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in https://github.com/apache/spark/pull/38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes #38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran Signed-off-by: Sean Owen --- docs/mllib-isotonic-regression.md | 18 ++--- .../mllib/regression/IsotonicRegression.scala | 82 ++ .../mllib/regression/IsotonicRegressionSuite.scala | 29 +--- 3 files changed, 67 insertions(+), 62 deletions(-) diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md index 95be32a819e..711e828bd80 100644 --- a/docs/mllib-isotonic-regression.md +++ b/docs/mllib-isotonic-regression.md @@ -43,7 +43,14 @@ best fitting the original data points. which uses an approach to [parallelizing isotonic regression](https://doi.org/10.1007/978-3-642-99789-1_10). The training input is an RDD of tuples of three double values that represent -label, feature and weight in this order. Additionally, IsotonicRegression algorithm has one +label, feature and weight in this order. In case there are multiple tuples with +the same feature then these tuples are aggregated into a single tuple as follows: + +* Aggregated label is the weighted average of all labels. +* Aggregated feature is the unique feature value. +* Aggregated weight is the sum of all weights. + +Additionally, IsotonicRegression algorithm has one optional parameter called $isotonic$ defaulting to true. This argument specifies if the isotonic regression is isotonic (monotonically increasing) or antitonic (monotonically decreasing). @@ -53,17 +60,12 @@ labels for both known and unknown features. The result of isotonic regression is treated as piecewise linear function. The rules for prediction therefore are: * If the prediction input exactly matches a training feature - then associated prediction is returned. In case there are multiple predictions with the same - feature then one of them is returned. Which one is undefined - (same as java.util.Arrays.binarySearch). + then associated prediction is returned. * If the prediction input is lower or higher than all training features then prediction with lowest or highest feature is returned respectively. - In case there are multiple predictions with the same feature - then the lowest or highest is returned respectively. * If the prediction input falls between two training features then prediction is treated as piecewise linear function and interpolated value is calculated from the - predictions of the two closest features. In case there are multiple values - with the same feature then the same rules as in previous point are used. + predictions of the two closest features. ### Examples diff --git a/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala b/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala index 0b2bf147501..fbf0dc9c357 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala @@ -23,7 +23,6 @@ import java.util.Arrays.binarySearch import scala.collection.JavaConverters._ import scala.collection.mutable.ArrayBuffer -import