Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]

2024-02-16 Thread via GitHub


HeartSaVioR commented on PR #45141:
URL: https://github.com/apache/spark/pull/45141#issuecomment-1949492132

   Late +1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45145:
URL: https://github.com/apache/spark/pull/45145#issuecomment-1949485755

   BTW, did you upload the screenshot? The link seems to be broken, 
@gengliangwang .
   
   ![Screenshot 2024-02-16 at 15 35 
57](https://github.com/apache/spark/assets/9700541/1183e3a8-9e37-4356-a191-0e14fd58ebd3)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on code in PR #45145:
URL: https://github.com/apache/spark/pull/45145#discussion_r1493073011


##
docs/sql-data-sources-parquet.md:
##
@@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` 
or by running
   3.3.0
 
 
-  spark.sql.parquet.timestampNTZ.enabled

Review Comment:
   So, previous this was no-op because Spark has no this configuration name, 
@gengliangwang ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]

2024-02-16 Thread via GitHub


gengliangwang commented on code in PR #45145:
URL: https://github.com/apache/spark/pull/45145#discussion_r1493073339


##
docs/sql-data-sources-parquet.md:
##
@@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` 
or by running
   3.3.0
 
 
-  spark.sql.parquet.timestampNTZ.enabled

Review Comment:
   Yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on code in PR #45145:
URL: https://github.com/apache/spark/pull/45145#discussion_r1493073011


##
docs/sql-data-sources-parquet.md:
##
@@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` 
or by running
   3.3.0
 
 
-  spark.sql.parquet.timestampNTZ.enabled

Review Comment:
   So, previous this was no-op, @gengliangwang ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]

2024-02-16 Thread via GitHub


gengliangwang opened a new pull request, #45145:
URL: https://github.com/apache/spark/pull/45145

   
   
   ### What changes were proposed in this pull request?
   
   This is a follow-up of https://github.com/apache/spark/pull/39856. The 
configuration changes should be reflected in the Parquet data source doc 
   
   ### Why are the changes needed?
   
   
   To fix doc
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Preview:
   ![Uploading image.png…]()
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45144:
URL: https://github.com/apache/spark/pull/45144#issuecomment-1949478931

   How can I help you, @holdenk ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45144:
URL: https://github.com/apache/spark/pull/45144#issuecomment-1949475905

   FYI, all GitHub Action works fine without any compilation failure like that. 
The is the result from my environment.
   ```
   $ java -version
   openjdk version "17.0.10" 2024-01-16 LTS
   OpenJDK Runtime Environment AppleJDK-17.0.10.7.1 (build 17.0.10+7-LTS)
   OpenJDK 64-Bit Server VM AppleJDK-17.0.10.7.1 (build 17.0.10+7-LTS, mixed 
mode, sharing)
   
   $ build/sbt
   sbt:spark-parent> clean ;compile;catalyst/testOnly 
org.apache.spark.sql.catalyst.optimizer.FilterPushdownSuite
   ...
   [info] FilterPushdownSuite:
   [info] - eliminate subqueries (21 milliseconds)
   [info] - simple push down (12 milliseconds)
   [info] - combine redundant filters (12 milliseconds)
   [info] - do not combine non-deterministic filters even if they are identical 
(10 milliseconds)
   [info] - SPARK-16164: Filter pushdown should keep the ordering in the 
logical plan (7 milliseconds)
   [info] - SPARK-16994: filter should not be pushed through limit (3 
milliseconds)
   [info] - can't push without rewrite (11 milliseconds)
   [info] - nondeterministic: can always push down filter through project with 
deterministic field (11 milliseconds)
   [info] - nondeterministic: can't push down filter through project with 
nondeterministic field (4 milliseconds)
   [info] - nondeterministic: can't push down filter through aggregate with 
nondeterministic field (6 milliseconds)
   [info] - nondeterministic: push down part of filter through aggregate with 
deterministic field (9 milliseconds)
   [info] - filters: combines filters (5 milliseconds)
   [info] - joins: push to either side (8 milliseconds)
   [info] - joins: push to one side (6 milliseconds)
   [info] - joins: do not push down non-deterministic filters into join 
condition (2 milliseconds)
   [info] - joins: push to one side after transformCondition (6 milliseconds)
   [info] - joins: rewrite filter to push to either side (4 milliseconds)
   [info] - joins: push down left semi join (6 milliseconds)
   [info] - joins: push down left outer join #1 (6 milliseconds)
   [info] - joins: push down right outer join #1 (6 milliseconds)
   [info] - joins: push down left outer join #2 (5 milliseconds)
   [info] - joins: push down right outer join #2 (6 milliseconds)
   [info] - joins: push down left outer join #3 (6 milliseconds)
   [info] - joins: push down right outer join #3 (13 milliseconds)
   [info] - joins: push down left outer join #4 (6 milliseconds)
   [info] - joins: push down right outer join #4 (10 milliseconds)
   [info] - joins: push down left outer join #5 (6 milliseconds)
   [info] - joins: push down right outer join #5 (9 milliseconds)
   [info] - joins: can't push down (4 milliseconds)
   [info] - joins: conjunctive predicates (8 milliseconds)
   [info] - joins: conjunctive predicates #2 (3 milliseconds)
   [info] - joins: conjunctive predicates #3 (7 milliseconds)
   [info] - joins: push down where clause into left anti join (5 milliseconds)
   [info] - joins: only push down join conditions to the right of a left anti 
join (4 milliseconds)
   [info] - joins: only push down join conditions to the right of an existence 
join (4 milliseconds)
   [info] - generate: predicate referenced no generated column (9 milliseconds)
   [info] - generate: non-deterministic predicate referenced no generated 
column (7 milliseconds)
   [info] - generate: part of conjuncts referenced generated column (6 
milliseconds)
   [info] - generate: all conjuncts referenced generated column (2 milliseconds)
   [info] - aggregate: push down filter when filter on group by expression (7 
milliseconds)
   [info] - aggregate: don't push down filter when filter not on group by 
expression (5 milliseconds)
   [info] - aggregate: push down filters partially which are subset of group by 
expressions (8 milliseconds)
   [info] - aggregate: push down filters with alias (5 milliseconds)
   [info] - aggregate: push down filters with literal (9 milliseconds)
   [info] - aggregate: don't push down filters that are nondeterministic (7 
milliseconds)
   [info] - SPARK-17712: aggregate: don't push down filters that are 
data-independent (6 milliseconds)
   [info] - aggregate: don't push filters if the aggregate has no grouping 
expressions (3 milliseconds)
   [info] - SPARK-32940: aggregate: push filters through first, last and 
collect (17 milliseconds)
   [info] - union (11 milliseconds)
   [info] - expand (19 milliseconds)
   [info] - predicate subquery: push down simple (15 milliseconds)
   [info] - predicate subquery: push down complex (8 milliseconds)
   [info] - SPARK-20094: don't push predicate with IN subquery into join 
condition (11 milliseconds)
   [info] - Window: predicate push down -- basic (24 milliseconds)
   [info] - Window: predicate push down -- predicates with compound predicate 
using only one column (5 milliseconds)
   [info] - Window: predicate push down -- multi window

Re: [PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45144:
URL: https://github.com/apache/spark/pull/45144#issuecomment-1949469508

   From the following PR description, Apache Spark 4.0 only support Java 17+, 
doesn't it?
   
   > Building with sbt & JDK11 or 17 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]

2024-02-16 Thread via GitHub


holdenk opened a new pull request, #45144:
URL: https://github.com/apache/spark/pull/45144

   
   
   ### What changes were proposed in this pull request?
   
   This reverts commit f5b1e37c9e6a4ec2fd897f97cd4526415e6c0e49.
   
   
   
   ### Why are the changes needed?
   Building with sbt & JDK11 or 17 (executed after reload & clean 
;compile;catalyst/testOnly 
org.apache.spark.sql.catalyst.optimizer.FilterPushdownSuite) results in
   
   
   
   
    
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:20:8:
 object WebDriver is not a member of package org.openqa.selenium
   [error] import org.openqa.selenium.WebDriver
   [error]        ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:33:27:
 not found: type WebDriver
   [error]   override var webDriver: WebDriver = _
   [error]                           ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:37:29:
 Class org.openqa.selenium.remote.AbstractDriverOptions not found - continuing 
with a stub.
   [error]     val chromeOptions = new ChromeOptions
   [error]                             ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala:24:8:
 object WebDriver is not a member of package org.openqa.selenium
   [error] import org.openqa.selenium.WebDriver
   [error]        ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala:43:27:
 not found: type WebDriver
   [error]   implicit var webDriver: WebDriver
   [error]                           ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:39:21:
 Class org.openqa.selenium.remote.RemoteWebDriver not found - continuing with a 
stub.
   [error]     webDriver = new ChromeDriver(chromeOptions)
   [error]                     ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:20:28:
 Unused import
   [error] Applicable -Wconf / @nowarn filters for this fatal warning: 
msg=, cat=unused-imports, 
site=org.apache.spark.deploy.history
   [error] import org.openqa.selenium.WebDriver
   [error]                            ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:36:8:
 object WebDriver is not a member of package org.openqa.selenium
   [error] import org.openqa.selenium.WebDriver
   [error]        ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:383:29:
 not found: type WebDriver
   [error]     implicit val webDriver: WebDriver = new HtmlUnitDriver
   [error]                             ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:37:8:
 Class org.openqa.selenium.WebDriver not found - continuing with a stub.
   [error] import org.openqa.selenium.htmlunit.HtmlUnitDriver
   [error]        ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:383:45:
 Class org.openqa.selenium.Capabilities not found - continuing with a stub.
   [error]     implicit val webDriver: WebDriver = new HtmlUnitDriver
   [error]                                             ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:470:9:
 Symbol 'type org.openqa.selenium.WebDriver' is missing from the classpath.
   [error] This symbol is required by 'value 
org.scalatestplus.selenium.WebBrowser.go.driver'.
   [error] Make sure that type WebDriver is in your classpath and check for 
conflicting dependencies with `-Ylog-classpath`.
   [error] A full rebuild may help if 'WebBrowser.class' was compiled against 
an incompatible version of org.openqa.selenium.
   [error]         go to target.toExternalForm
   [error]         ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:470:12:
 could not find implicit value for parameter driver: 
org.openqa.selenium.WebDriver
   [error]         go to target.toExternalForm
   [error]            ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:36:28:
 Unused import
   [error] Applicable -Wconf / @nowarn filters for this fatal warning: 
msg=, cat=unused-imports, 
site=org.apache.spark.deploy.history
   [error] import org.openqa.selenium.WebDriver
   [error]                            ^
   [error] 
/home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala:24:28:
 Unused import
   [error] A

Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1493050589


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by

Review Comment:
   Good point! Would you mind if I file a follow-up pr for documentation? We 
might also want to edit the PySpark debugging guide together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1493049117


##
python/pyspark/sql/session.py:
##
@@ -906,6 +907,12 @@ def dataSource(self) -> "DataSourceRegistration":
 
 return DataSourceRegistration(self)
 
+@property
+def profile(self) -> "Profile":

Review Comment:
   I planned to make a follow-up PR for Connect. But now I think I'll just add 
Connect changes to this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1493048620


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession

Review Comment:
   Yes after removing the old APIs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1493048037


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:
+"""
+Show the profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+type : str, optional
+The profiler type, which can be either "perf" or "memory".
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+if type == "memory":
+self.sparkSession.showMemoryProfiles(id)

Review Comment:
   Sounds good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on code in PR #45129:
URL: https://github.com/apache/spark/pull/45129#discussion_r1493048255


##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:

Review Comment:
   Your suggestion makes more sense. Thanks!



##
python/pyspark/sql/profiler.py:
##
@@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults":
 with self._lock:
 value = self._accumulator.value
 return value if value is not None else {}
+
+
+class Profile:
+"""User-facing profile API. This instance can be accessed by
+:attr:`spark.profile`.
+
+.. versionadded: 4.0.0
+"""
+
+def __init__(self, sparkSession: "SparkSession"):
+self.sparkSession = sparkSession
+
+def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> 
None:
+"""
+Show the profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+type : str, optional
+The profiler type, which can be either "perf" or "memory".
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+if type == "memory":
+self.sparkSession.showMemoryProfiles(id)
+elif type == "perf" or type is None:
+self.sparkSession.showPerfProfiles(id)
+if type is None:  # Show both perf and memory profiles
+self.sparkSession.showMemoryProfiles(id)
+else:
+raise PySparkValueError(
+error_class="VALUE_NOT_ALLOWED",
+message_parameters={
+"arg_name": "type",
+"allowed_values": str(["perf", "memory"]),
+},
+)
+
+def dump(self, path: str, *, type: Optional[str] = None, id: Optional[int] 
= None) -> None:

Review Comment:
   Adjusted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47032][Python] Prototype for adding pass-through columns to Python UDTF API [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on code in PR #45142:
URL: https://github.com/apache/spark/pull/45142#discussion_r1493041669


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala:
##
@@ -195,13 +198,13 @@ case class PythonUDTF(
 
   override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): PythonUDTF =
 copy(children = newChildren)
-}

Review Comment:
   This removal looks like a mistake. Could you make CIs happy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun closed pull request #45143: [SPARK-47076][CORE][TESTS] Fix 
HistoryServerSuite.`incomplete apps get refreshed` test to start with empty 
storeDir
URL: https://github.com/apache/spark/pull/45143


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45143:
URL: https://github.com/apache/spark/pull/45143#issuecomment-1949422724

   I addresses the comments. Let me merge this and keep monitoring this suite. 
Thank you again.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on code in PR #45143:
URL: https://github.com/apache/spark/pull/45143#discussion_r1493032810


##
core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:
##
@@ -390,6 +390,8 @@ abstract class HistoryServerSuite extends SparkFunSuite 
with BeforeAndAfter with
 // a new conf is used with the background thread set and running at its 
fastest
 // allowed refresh rate (1Hz)
 stop()
+Utils.deleteRecursively(storeDir)
+assert(storeDir.mkdir())

Review Comment:
   Sure. Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


viirya commented on code in PR #45143:
URL: https://github.com/apache/spark/pull/45143#discussion_r1493032622


##
core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:
##
@@ -390,6 +390,8 @@ abstract class HistoryServerSuite extends SparkFunSuite 
with BeforeAndAfter with
 // a new conf is used with the background thread set and running at its 
fastest
 // allowed refresh rate (1Hz)
 stop()
+Utils.deleteRecursively(storeDir)
+assert(storeDir.mkdir())

Review Comment:
   Looks okay. Maybe add a comment on top of it explaining why it is needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45143:
URL: https://github.com/apache/spark/pull/45143#issuecomment-1949403179

   BTW, I'm still looking at this test suite because the failure is not 
reproducible in local env yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45143:
URL: https://github.com/apache/spark/pull/45143#issuecomment-1949399210

   Could you review this test case PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45143:
URL: https://github.com/apache/spark/pull/45143#issuecomment-1949398958

   `core` passed.
   
   ![Screenshot 2024-02-16 at 14 03 
10](https://github.com/apache/spark/assets/9700541/36e7a7de-20e4-4115-aea6-65bc8d100e5b)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on PR #45138:
URL: https://github.com/apache/spark/pull/45138#issuecomment-1949325532

   Good to know, thanks @dongjoon-hyun !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun opened a new pull request, #45143:
URL: https://github.com/apache/spark/pull/45143

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45138:
URL: https://github.com/apache/spark/pull/45138#issuecomment-1949306231

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun closed pull request #45138: [SPARK-47075][BUILD] Add 
`derby-provided` profile
URL: https://github.com/apache/spark/pull/45138


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45138:
URL: https://github.com/apache/spark/pull/45138#issuecomment-1949304069

   Thank you, @xinrong-meng .
   
   We can skip this because we only document `hadoop-provided` so far.
   
   ```
   $ git grep '\-provided' assembly
   assembly/pom.xml:  hadoop-provided
   assembly/pom.xml:  hive-provided
   assembly/pom.xml:  orc-provided
   assembly/pom.xml:  parquet-provided
   ```
   
   ```
   $ git grep hadoop-provided docs/
   docs/building-spark.md:The `hadoop-provided` profile builds the assembly 
without including Hadoop-ecosystem projects,
   docs/index.md:[by augmenting Spark's classpath](hadoop-provided.html).
   
   $ git grep hive-provided docs/
   
   $ git grep orc-provided docs/
   
   $ git grep parquet-provided docs/
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on PR #45138:
URL: https://github.com/apache/spark/pull/45138#issuecomment-1949299783

   LGTM, thanks @dongjoon-hyun . Out of curiosity, do we have documentation on 
those build profiles?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun closed pull request #45137: [SPARK-47074][INFRA] Fix outdated 
comments in GitHub Action scripts
URL: https://github.com/apache/spark/pull/45137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45137:
URL: https://github.com/apache/spark/pull/45137#issuecomment-1949296923

   Thank you, @xinrong-meng ! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]

2024-02-16 Thread via GitHub


xinrong-meng commented on PR #45137:
URL: https://github.com/apache/spark/pull/45137#issuecomment-1949296302

   LGTM, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-16 Thread via GitHub


sahnib commented on code in PR #44961:
URL: https://github.com/apache/spark/pull/44961#discussion_r1492925832


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala:
##
@@ -131,6 +141,15 @@ trait StateStore extends ReadStateStore {
   def remove(key: UnsafeRow,
 colFamilyName: String = StateStore.DEFAULT_COL_FAMILY_NAME): Unit
 
+  /**
+   * Merges the provided value with existing values of a non-null key.

Review Comment:
   Makes sense, done. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-16 Thread via GitHub


sahnib commented on code in PR #44961:
URL: https://github.com/apache/spark/pull/44961#discussion_r1492924832


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala:
##
@@ -67,6 +67,15 @@ trait ReadStateStore {
   def get(key: UnsafeRow,
 colFamilyName: String = StateStore.DEFAULT_COL_FAMILY_NAME): UnsafeRow
 
+  /**
+   * Provides an iterator containing all values of a non-null key.

Review Comment:
   Sounds good. Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-16 Thread via GitHub


sahnib commented on code in PR #44961:
URL: https://github.com/apache/spark/pull/44961#discussion_r1492922006


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala:
##
@@ -75,6 +76,61 @@ private[sql] class RocksDBStateStoreProvider
   value
 }
 
+/**
+ * Provides an iterator containing all values of a non-null key.
+ *
+ * Inside RocksDB, the values are merged together and stored as a byte 
Array.
+ * This operation relies on state store value encoder to be able to split 
the
+ * single array into multiple values.
+ *
+ * Also see [[MultiValuedStateEncoder]] which supports encoding/decoding 
multiple
+ * values per key.
+ */
+override def valuesIterator(key: UnsafeRow, colFamilyName: String): 
Iterator[UnsafeRow] = {
+  verify(key != null, "Key cannot be null")
+
+  val kvEncoder = keyValueEncoderMap.get(colFamilyName)
+  val valueEncoder = kvEncoder._2
+  val keyEncoder = kvEncoder._1
+
+  verify(valueEncoder.supportsMultipleValuesPerKey, "valuesIterator 
requires a encoder " +
+  "that supports multiple values for a single key.")
+  val encodedKey = rocksDB.get(keyEncoder.encodeKey(key), colFamilyName)
+  val valueIterator = valueEncoder.decodeValues(encodedKey)
+
+  if (valueIterator.nonEmpty) {
+new Iterator[UnsafeRow] {
+  override def hasNext: Boolean = {
+valueIterator.hasNext
+  }
+
+  override def next(): UnsafeRow = {
+val value = valueIterator.next()
+if (value != null) {
+  StateStoreProvider.validateStateRowFormat(

Review Comment:
   Removed the wrapper iterator, we don't need it anymore.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala:
##
@@ -75,6 +76,61 @@ private[sql] class RocksDBStateStoreProvider
   value
 }
 
+/**
+ * Provides an iterator containing all values of a non-null key.
+ *
+ * Inside RocksDB, the values are merged together and stored as a byte 
Array.
+ * This operation relies on state store value encoder to be able to split 
the
+ * single array into multiple values.
+ *
+ * Also see [[MultiValuedStateEncoder]] which supports encoding/decoding 
multiple
+ * values per key.
+ */
+override def valuesIterator(key: UnsafeRow, colFamilyName: String): 
Iterator[UnsafeRow] = {
+  verify(key != null, "Key cannot be null")
+
+  val kvEncoder = keyValueEncoderMap.get(colFamilyName)
+  val valueEncoder = kvEncoder._2
+  val keyEncoder = kvEncoder._1
+
+  verify(valueEncoder.supportsMultipleValuesPerKey, "valuesIterator 
requires a encoder " +
+  "that supports multiple values for a single key.")
+  val encodedKey = rocksDB.get(keyEncoder.encodeKey(key), colFamilyName)
+  val valueIterator = valueEncoder.decodeValues(encodedKey)
+
+  if (valueIterator.nonEmpty) {
+new Iterator[UnsafeRow] {
+  override def hasNext: Boolean = {
+valueIterator.hasNext
+  }
+
+  override def next(): UnsafeRow = {
+val value = valueIterator.next()
+if (value != null) {
+  StateStoreProvider.validateStateRowFormat(

Review Comment:
   Removed the wrapper iterator, we don't need it anymore. Thanks for pointing 
this out. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-16 Thread via GitHub


sahnib commented on code in PR #44961:
URL: https://github.com/apache/spark/pull/44961#discussion_r1492920813


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImpl.scala:
##
@@ -77,9 +50,7 @@ class ValueStateImpl[S](
   override def getOption(): Option[S] = {
 val retRow = getImpl()

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-16 Thread via GitHub


sahnib commented on code in PR #44961:
URL: https://github.com/apache/spark/pull/44961#discussion_r1489800175


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala:
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import org.apache.commons.lang3.SerializationUtils
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions.{UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.execution.streaming.state.StateStoreErrors
+import org.apache.spark.sql.types.{BinaryType, StructType}
+
+/**
+ * Helper object providing APIs to encodes the grouping key, and user provided 
values
+ * to Spark [[UnsafeRow]].
+ */
+object StateTypesEncoderUtils {
+
+  private val KEY_ROW_SCHEMA: StructType = new StructType().add("key", 
BinaryType)
+  private val VALUE_ROW_SCHEMA: StructType = new StructType().add("value", 
BinaryType)
+
+  // TODO: validate places that are trying to encode the key and check if we 
can eliminate/
+  // add caching for some of these calls.
+  def encodeGroupingKey(stateName: String, keyExprEnc: 
ExpressionEncoder[Any]): UnsafeRow = {
+val keyOption = ImplicitGroupingKeyTracker.getImplicitKeyOption
+if (keyOption.isEmpty) {
+  throw StateStoreErrors.implicitKeyNotFound(stateName)
+}
+
+val toRow = keyExprEnc.createSerializer()
+val keyByteArr = toRow
+  .apply(keyOption.get).asInstanceOf[UnsafeRow].getBytes()
+
+val keyEncoder = UnsafeProjection.create(KEY_ROW_SCHEMA)

Review Comment:
   We do not anticipate a different schema. Every key, value will be encoded to 
a byte Array. While working on refactoring key/value Encoder to move them to 
object level, I discovered that `UnsafeProjection` is not thread safe. Hence, 
we cannot use it across state variables. 
   
   Also, the user might try to do something weird - and access state data 
concurrently inside `transformWithState`. I think we should create a projection 
per encoding. WDYT? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-47032][Python] Prototype for adding pass-through columns to Python UDTF API [spark]

2024-02-16 Thread via GitHub


dtenedor opened a new pull request, #45142:
URL: https://github.com/apache/spark/pull/45142

   ### What changes were proposed in this pull request?
   
   [WIP] This is a prototype for adding pass-through columns to Python UDTF 
API. We'll develop it more before sending out for formal review.
   
   ### Why are the changes needed?
   
   We can use this API to forward column values from the UDTF input table to 
the output table directly, bypassing JVM/Python interchange and improving 
performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR adds test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   NO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47057][PYTHON] Reeanble MyPy data test [spark]

2024-02-16 Thread via GitHub


ueshin commented on PR #45135:
URL: https://github.com/apache/spark/pull/45135#issuecomment-1949130300

   We should also upgrade the minimum version?
   
   
https://github.com/apache/spark/blob/3901c6e1c63f300066847ae9393a8b424d7bb0c0/dev/lint-python#L21


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47043][BUILD] add `jackson-core` and `jackson-annotations` dependencies to module `spark-common-utils` [spark]

2024-02-16 Thread via GitHub


William1104 commented on PR #45103:
URL: https://github.com/apache/spark/pull/45103#issuecomment-1949062510

   Hi @dongjoon-hyun , 
   
   Thanks for your explanation. Yes, the spark is a large and very dynamic 
community. It could be challenging to enforce this, especially since some 
modules already have more than required dependencies. I would propose fixing 
dependencies issue module by module, and then enforcing it in a systematic way 
(via maven dependency plugin). 
   
   Let me send an email to d...@spark.apache.org on this topic. Thanks a lot. 
   
   Best regards, 
   William


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070] Fix invalid aggregation after in-subquery rewrite [spark]

2024-02-16 Thread via GitHub


agubichev commented on PR #45133:
URL: https://github.com/apache/spark/pull/45133#issuecomment-1949032191

   @jchen5 @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Show sort order of NaN relative to infinity [spark]

2024-02-16 Thread via GitHub


nchammas commented on PR #45047:
URL: https://github.com/apache/spark/pull/45047#issuecomment-1948964067

   cc @cloud-fan and @peter-toth since we have been discussing NaNs over on 
#45036.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45117:
URL: https://github.com/apache/spark/pull/45117#issuecomment-1948944084

   Thank you, @grundprinzip .
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun closed pull request #45117: [SPARK-47040][CONNECT][FOLLOWUP] 
Improve usability of the start-conect-server-script.sh
URL: https://github.com/apache/spark/pull/45117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun closed pull request #45141: [SPARK-45357][CONNECT][TESTS][3.5] 
Normalize `dataframeId` when comparing `CollectMetrics` in 
`SparkConnectProtoSuite`
URL: https://github.com/apache/spark/pull/45141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45141:
URL: https://github.com/apache/spark/pull/45141#issuecomment-1948939396

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47042][BUILD] add missing explicit dependency 'commons-lang3' to the module 'spark-common-utils' [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45101:
URL: https://github.com/apache/spark/pull/45101#issuecomment-1948933031

   Here is a comment about your question.
   - https://github.com/apache/spark/pull/45103#issuecomment-1948928099


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47043][BUILD] add `jackson-core` and `jackson-annotations` dependencies to module `spark-common-utils` [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45103:
URL: https://github.com/apache/spark/pull/45103#issuecomment-1948928099

   Unlike other communities like Apache ORC, Apache Spark community doesn't  
take advantage of `maven-dependency-plugin` to enforce those `Used undeclared` 
and `Unused declared` and `Non-test scoped test only dependencies`. The main 
reason is because it would be quite inconvenient in a large and dynamic 
community like Apache Spark community.
   
   Apache ORC Example:
   
https://github.com/apache/orc/blob/0a02e4cde165b81fb93fc99456a130da4625ef30/java/core/pom.xml#L160-L172
   
   In addition, if you want to propose it, it should be applied systematically 
like Apache ORC community for all modules, not like many small PRs like this PR 
and #45101 .
   
   However, if you have a concern about this area and want to contribute that 
area, I'd like to recommend you to send an email to d...@spark.apache.org . We 
can discuss more and build a consensus on the necessity of your proposal, 
@William1104 . You can use your PRs' links in your email.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45136:
URL: https://github.com/apache/spark/pull/45136#issuecomment-1948876938

   Thank you, @LuciferYang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47043][BUILD] add `jackson-core` and `jackson-annotations` dependencies to module `spark-common-utils` [spark]

2024-02-16 Thread via GitHub


William1104 commented on PR #45103:
URL: https://github.com/apache/spark/pull/45103#issuecomment-1948854042

   Hi @dongjoon-hyun,
   
   Thank you for providing the script. If I understand correctly, the script is 
designed to check if any dependencies have been updated in an unexpected manner 
for every profile. This functionality is extremely valuable as it helps us 
avoid making careless mistakes and provides the reviewer with a clearer 
understanding of how dependencies will be updated.
   
   It appears that several Spark modules are using libraries without proper 
declaration, or they have declared certain libraries but are not actually using 
them. To generate a report highlighting these issues, I ran the following 
command:
   
   ```
./build/mvn dependency:analyze | sed -n '/<<< dependency:.*:analyze/,/>>> 
dependency:.*:analyze/p' > dependency-analyze
   ```
   
   In the "dependency-analyze" report, let's take the module 
"spark-common-utils" as an example. It currently has a compile scope dependency 
on "commons-io," which could be changed to the test scope to avoid unnecessary 
transitive dependencies. Here are the relevant excerpts from the report:
   
   ```
   [INFO] --- dependency:3.6.1:analyze (default-cli) @ spark-common-utils_2.13 
---
   [WARNING] Used undeclared dependencies found:
   [WARNING]
com.fasterxml.jackson.core:jackson-annotations:jar:2.16.1:compile
   [WARNING]org.apache.commons:commons-lang3:jar:3.14.0:compile
   [WARNING]com.fasterxml.jackson.core:jackson-core:jar:2.16.1:compile
   [WARNING]org.scala-lang:scala-library:jar:2.13.12:compile
   [WARNING]org.scalatest:scalatest-funsuite_2.13:jar:3.2.17:test
   [WARNING]org.scalactic:scalactic_2.13:jar:3.2.17:test
   [WARNING]org.scalatest:scalatest-compatible:jar:3.2.17:test
   [WARNING]org.scalatest:scalatest-core_2.13:jar:3.2.17:test
   [WARNING] Unused declared dependencies found:
   [WARNING]
com.fasterxml.jackson.module:jackson-module-scala_2.13:jar:2.16.1:compile
   [WARNING]oro:oro:jar:2.0.8:compile
   [WARNING]org.slf4j:jul-to-slf4j:jar:2.0.11:compile
   [WARNING]org.slf4j:jcl-over-slf4j:jar:2.0.11:compile
   [WARNING]org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.22.1:compile
   [WARNING]org.apache.logging.log4j:log4j-1.2-api:jar:2.22.1:compile
   [WARNING]org.spark-project.spark:unused:jar:1.0.0:compile
   [WARNING]org.scalatest:scalatest_2.13:jar:3.2.17:test
   [WARNING]org.scalatestplus:scalacheck-1-17_2.13:jar:3.2.17.0:test
   [WARNING]org.scalatestplus:mockito-4-11_2.13:jar:3.2.17.0:test
   [WARNING]org.scalatestplus:selenium-4-12_2.13:jar:3.2.17.0:test
   [WARNING]org.junit.jupiter:junit-jupiter:jar:5.9.3:test
   [WARNING]net.aichler:jupiter-interface:jar:0.11.1:test
   [WARNING] Non-test scoped test only dependencies found:
   [WARNING]commons-io:commons-io:jar:2.15.1:compile
   [INFO]
   ```
   This information suggests that there are both used undeclared dependencies 
and unused declared dependencies. Additionally, there are non-test scoped 
test-only dependencies, such as "commons-io:commons-io:jar:2.15.1:compile." 
These findings can help us identify areas where we can optimize and refine the 
dependency management within the "spark-common-utils" module.
   
   I would like to create PR to fix the dependency scope. 
   
   Thanks and regards, 
   William


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]

2024-02-16 Thread via GitHub


LuciferYang commented on PR #45141:
URL: https://github.com/apache/spark/pull/45141#issuecomment-1948673492

   cc @HeartSaVioR @srowen @amaliujia 
   
   Actually, I didn't reproduce this issue locally because when I use Maven to 
test the branch-3.5 `connect` module, the order of test case execution is 
`SparkConnectStreamingQueryCacheSuite`, `ExecuteEventsManagerSuite`, 
`SparkConnectProtoSuite`... and there are no `DataFrame` instances in 
`SparkConnectStreamingQueryCacheSuite` and `ExecuteEventsManagerSuite`. 
Therefore, `sparkTestRelation` in `SparkConnectProtoSuite` is still the first 
`DataFrame` to be initialized, its `id` is 0, which bypasses this issue.
   
   However, we can use some tricky methods to reproduce the failure. For 
example, change
   
   
https://github.com/apache/spark/blob/1c1c5faa29dc649faf143fe2eea39ccf15862f85/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala#L90-L94
   
   to
   
   ```scala
   test("Basic select") {
 val connectPlan = connectTestRelation.select("id".protoAttr)
 val sparkPlan = sparkTestRelation2.select("id")
 comparePlans(connectPlan, sparkPlan)
   }
   ```
   
   In this way, `sparkTestRelation` will definitely not be the first 
`DataFrame` to be initialized, and the test failure can be reproduced, but 
`Basic select` will still pass the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]

2024-02-16 Thread via GitHub


LuciferYang opened a new pull request, #45141:
URL: https://github.com/apache/spark/pull/45141

   ### What changes were proposed in this pull request?
   This PR add a new function `normalizeDataframeId` to sets the `dataframeId` 
to the constant 0 of `CollectMetrics`  before comparing `LogicalPlan` in the 
test case of `SparkConnectProtoSuite`.
   
   ### Why are the changes needed?
   The test scenario in `SparkConnectProtoSuite` does not need to compare the 
`dataframeId` in `CollectMetrics`
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   - Manually check
   
   run 
   
   ```
   build/mvn clean install -pl connector/connect/server -am -DskipTests
   build/mvn test -pl connector/connect/server 
   ```
   
   **Before**
   
   ```
   - Test observe *** FAILED ***
 == FAIL: Plans do not match ===
 !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS 
max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric, [min(id#0) 
AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53
  +- LocalRelation , [id#0, name#0]  
   +- LocalRelation , [id#0, name#0] 
(PlanTest.scala:179) 
   ```
   
   **After**
   
   ```
   Run completed in 41 seconds, 631 milliseconds.
   Total number of tests run: 882
   Suites: completed 24, aborted 0
   Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ```
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]

2024-02-16 Thread via GitHub


MaxGekk commented on PR #45127:
URL: https://github.com/apache/spark/pull/45127#issuecomment-1948471185

   Here are backposts to:
   - 3.4 https://github.com/apache/spark/pull/45140
   - 3.5 https://github.com/apache/spark/pull/45139


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47072][SQL][3.4] Fix supported interval formats in error messages [spark]

2024-02-16 Thread via GitHub


MaxGekk opened a new pull request, #45140:
URL: https://github.com/apache/spark/pull/45140

   ### What changes were proposed in this pull request?
   In the PR, I propose to add one more field to keys of `supportedFormat` in 
`IntervalUtils` because current implementation has duplicate keys that 
overwrites each other. For instance, the following keys are the same:
   ```
   (YM.YEAR, YM.MONTH)
   ...
   (DT.DAY, DT.HOUR)
   ```
   because `YM.YEAR = DT.DAY = 0` and `YM.MONTH = DT.HOUR = 1`.
   
   This is a backport of https://github.com/apache/spark/pull/45127.
   
   ### Why are the changes needed?
   To fix the incorrect error message when Spark cannot parse ANSI interval 
string. For example, the expected format should be some year-month format but 
Spark outputs day-time one:
   ```sql
   spark-sql (default)> select interval '-\t2-2\t' year to month;
   
   Interval string does not match year-month format of `[+|-]d h`, `INTERVAL 
[+|-]'[+|-]d h' DAY TO HOUR` when cast to interval year to month: -2-2  
   . (line 1, pos 16)
   
   == SQL ==
   select interval '-\t2-2\t' year to month
   ^^^
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   By running the existing test suite:
   ```
   $ build/sbt "test:testOnly *IntervalUtilsSuite"
   ```
   and regenerating the golden files:
   ```
   $ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt 
"sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   
   Authored-by: Max Gekk 
   (cherry picked from commit 074fcf2807000d342831379de0fafc1e49a6bf19)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]

2024-02-16 Thread via GitHub


LuciferYang commented on PR #45136:
URL: https://github.com/apache/spark/pull/45136#issuecomment-1948455390

   Merged into master for Spark 4.0. Thanks @dongjoon-hyun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]

2024-02-16 Thread via GitHub


LuciferYang closed pull request #45136: [SPARK-47073][BUILD] Upgrade several 
Maven plugins to the latest versions
URL: https://github.com/apache/spark/pull/45136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]

2024-02-16 Thread via GitHub


grundprinzip commented on code in PR #45117:
URL: https://github.com/apache/spark/pull/45117#discussion_r1492430978


##
sbin/start-connect-server.sh:
##
@@ -40,8 +40,9 @@ fi
 
 if [ "$1" == "--wait" ]; then
   shift
-  exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect 
Server" "$@"
-else
-  exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"
+  export SPARK_NO_DAEMONIZE=1
 fi
+  exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47072][SQL][3.5] Fix supported interval formats in error messages [spark]

2024-02-16 Thread via GitHub


MaxGekk opened a new pull request, #45139:
URL: https://github.com/apache/spark/pull/45139

   ### What changes were proposed in this pull request?
   In the PR, I propose to add one more field to keys of `supportedFormat` in 
`IntervalUtils` because current implementation has duplicate keys that 
overwrites each other. For instance, the following keys are the same:
   ```
   (YM.YEAR, YM.MONTH)
   ...
   (DT.DAY, DT.HOUR)
   ```
   because `YM.YEAR = DT.DAY = 0` and `YM.MONTH = DT.HOUR = 1`.
   
   This is a backport of https://github.com/apache/spark/pull/45127.
   
   ### Why are the changes needed?
   To fix the incorrect error message when Spark cannot parse ANSI interval 
string. For example, the expected format should be some year-month format but 
Spark outputs day-time one:
   ```sql
   spark-sql (default)> select interval '-\t2-2\t' year to month;
   
   Interval string does not match year-month format of `[+|-]d h`, `INTERVAL 
[+|-]'[+|-]d h' DAY TO HOUR` when cast to interval year to month: -2-2  
   . (line 1, pos 16)
   
   == SQL ==
   select interval '-\t2-2\t' year to month
   ^^^
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   By running the existing test suite:
   ```
   $ build/sbt "test:testOnly *IntervalUtilsSuite"
   ```
   and regenerating the golden files:
   ```
   $ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt 
"sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   
   Authored-by: Max Gekk 
   (cherry picked from commit 074fcf2807000d342831379de0fafc1e49a6bf19)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]

2024-02-16 Thread via GitHub


MaxGekk closed pull request #45127: [SPARK-47072][SQL] Fix supported interval 
formats in error messages
URL: https://github.com/apache/spark/pull/45127


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]

2024-02-16 Thread via GitHub


MaxGekk commented on PR #45127:
URL: https://github.com/apache/spark/pull/45127#issuecomment-1948209316

   Merging to master. Thank you, @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun opened a new pull request, #45138:
URL: https://github.com/apache/spark/pull/45138

   ### What changes were proposed in this pull request?
   
   This PR aims to add `derby-provided` profile.
   
   ### Why are the changes needed?
   
   To help the users add their derby binary easily.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, this is a new additional profile.
   
   ### How was this patch tested?
   
   Manual.
   
   ```
   $ dev/make-distribution.sh -Phive,derby-provided
   $ ls dist/jars/derby*
   zsh: no matches found: dist/jars/derby*
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45137:
URL: https://github.com/apache/spark/pull/45137#issuecomment-1948169473

   Could you review this comment-update PR, @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun opened a new pull request, #45137:
URL: https://github.com/apache/spark/pull/45137

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47057][PYTHON] Reeanble MyPy data test [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45135:
URL: https://github.com/apache/spark/pull/45135#issuecomment-1948102134

   It seems that there are more leftover~ Could you take a look at them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45136:
URL: https://github.com/apache/spark/pull/45136#issuecomment-1947992760

   Could you review this when you have some time, @LuciferYang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47060][SQL][TESTS] Check `SparkIllegalArgumentException` instead of `IllegalArgumentException` in `catalyst` [spark]

2024-02-16 Thread via GitHub


MaxGekk closed pull request #45118: [SPARK-47060][SQL][TESTS] Check 
`SparkIllegalArgumentException` instead of `IllegalArgumentException` in 
`catalyst`
URL: https://github.com/apache/spark/pull/45118


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47060][SQL][TESTS] Check `SparkIllegalArgumentException` instead of `IllegalArgumentException` in `catalyst` [spark]

2024-02-16 Thread via GitHub


MaxGekk commented on PR #45118:
URL: https://github.com/apache/spark/pull/45118#issuecomment-1947972568

   Merging to master. Thank you, @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47073][BUILD] Upgrade `versions-maven-plugin` to 2.16.2 [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun opened a new pull request, #45136:
URL: https://github.com/apache/spark/pull/45136

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun closed pull request #45134: [SPARK-47071][SQL] Inline With 
expression if it contains special expression
URL: https://github.com/apache/spark/pull/45134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]

2024-02-16 Thread via GitHub


dongjoon-hyun commented on PR #45134:
URL: https://github.com/apache/spark/pull/45134#issuecomment-1947950945

   `org.apache.spark.sql.SparkSessionE2ESuite` failure is a known flaky one.
   
   ![Screenshot 2024-02-16 at 00 26 
56](https://github.com/apache/spark/assets/9700541/63207d05-59e9-496c-9150-240208ed2010)
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-16 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1492108932


##
python/pyspark/pandas/plot/matplotlib.py:
##
@@ -363,10 +364,23 @@ def _args_adjust(self):
 if is_list_like(self.bottom):
 self.bottom = np.array(self.bottom)
 
+def _ensure_frame(self, data):
+return data
+
+def _calculate_bins(self, data, bins):
+return bins

Review Comment:
   Pandas recently pushed couple of commits for refactoring the internal 
plotting structure such as https://github.com/pandas-dev/pandas/pull/55850 or 
https://github.com/pandas-dev/pandas/pull/55872, so we also should inherits 
couple of internal methods to follow the latest Pandas behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-16 Thread via GitHub


allisonwang-db commented on code in PR #45023:
URL: https://github.com/apache/spark/pull/45023#discussion_r1492090151


##
python/pyspark/sql/streaming/python_streaming_source_runner.py:
##
@@ -0,0 +1,160 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+import json
+from typing import IO
+
+from pyspark.accumulators import _accumulatorRegistry
+from pyspark.errors import PySparkAssertionError, PySparkRuntimeError
+from pyspark.java_gateway import local_connect_and_auth
+from pyspark.serializers import (
+read_int,
+write_int,
+write_with_length,
+SpecialLengths,
+)
+from pyspark.sql.datasource import DataSource, DataSourceStreamReader
+from pyspark.sql.types import (
+_parse_datatype_json_string,
+StructType,
+)
+from pyspark.util import handle_worker_exception
+from pyspark.worker_util import (
+check_python_version,
+read_command,
+pickleSer,
+send_accumulator_updates,
+setup_memory_limits,
+setup_spark_files,
+utf8_deserializer,
+)
+
+initial_offset_func_id = 884
+latest_offset_func_id = 885
+partitions_func_id = 886
+commit_func_id = 887
+
+
+def initial_offset_func(reader: DataSourceStreamReader, outfile: IO) -> None:
+offset = reader.initialOffset()
+write_with_length(json.dumps(offset).encode("utf-8"), outfile)
+
+
+def latest_offset_func(reader: DataSourceStreamReader, outfile: IO) -> None:
+offset = reader.latestOffset()
+write_with_length(json.dumps(offset).encode("utf-8"), outfile)
+
+
+def partitions_func(reader: DataSourceStreamReader, infile: IO, outfile: IO) 
-> None:
+start_offset = json.loads(utf8_deserializer.loads(infile))
+end_offset = json.loads(utf8_deserializer.loads(infile))
+partitions = reader.partitions(start_offset, end_offset)
+# Return the serialized partition values.
+write_int(len(partitions), outfile)
+for partition in partitions:
+pickleSer._write_with_length(partition, outfile)
+
+
+def commit_func(reader: DataSourceStreamReader, infile: IO, outfile: IO) -> 
None:
+end_offset = json.loads(utf8_deserializer.loads(infile))
+reader.commit(end_offset)
+write_int(0, outfile)
+
+
+def main(infile: IO, outfile: IO) -> None:
+try:
+check_python_version(infile)
+setup_spark_files(infile)
+
+memory_limit_mb = int(os.environ.get("PYSPARK_PLANNER_MEMORY_MB", 
"-1"))
+setup_memory_limits(memory_limit_mb)
+
+_accumulatorRegistry.clear()
+
+# Receive the data source instance.
+data_source = read_command(pickleSer, infile)
+
+if not isinstance(data_source, DataSource):
+raise PySparkAssertionError(
+error_class="PYTHON_DATA_SOURCE_TYPE_MISMATCH",
+message_parameters={
+"expected": "a Python data source instance of type 
'DataSource'",
+"actual": f"'{type(data_source).__name__}'",
+},
+)
+
+# Receive the data source output schema.
+schema_json = utf8_deserializer.loads(infile)
+schema = _parse_datatype_json_string(schema_json)
+if not isinstance(schema, StructType):
+raise PySparkAssertionError(
+error_class="PYTHON_DATA_SOURCE_TYPE_MISMATCH",
+message_parameters={
+"expected": "an output schema of type 'StructType'",
+"actual": f"'{type(schema).__name__}'",
+},
+)
+
+# Instantiate data source reader.
+try:
+reader = data_source.streamReader(schema=schema)
+# Initialization succeed.
+write_int(0, outfile)
+outfile.flush()
+
+# handle method call from socket
+while True:
+func_id = read_int(infile)
+if func_id == initial_offset_func_id:
+initial_offset_func(reader, outfile)
+elif func_id == latest_offset_func_id:
+latest_offset_func(reader, outfile)
+elif func_id == partitions_func_id:
+partitions_func(reader, infile, outfile)
+elif func_id == commit_func_