Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]
HeartSaVioR commented on PR #45141: URL: https://github.com/apache/spark/pull/45141#issuecomment-1949492132 Late +1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]
dongjoon-hyun commented on PR #45145: URL: https://github.com/apache/spark/pull/45145#issuecomment-1949485755 BTW, did you upload the screenshot? The link seems to be broken, @gengliangwang . ![Screenshot 2024-02-16 at 15 35 57](https://github.com/apache/spark/assets/9700541/1183e3a8-9e37-4356-a191-0e14fd58ebd3) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]
dongjoon-hyun commented on code in PR #45145: URL: https://github.com/apache/spark/pull/45145#discussion_r1493073011 ## docs/sql-data-sources-parquet.md: ## @@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` or by running 3.3.0 - spark.sql.parquet.timestampNTZ.enabled Review Comment: So, previous this was no-op because Spark has no this configuration name, @gengliangwang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]
gengliangwang commented on code in PR #45145: URL: https://github.com/apache/spark/pull/45145#discussion_r1493073339 ## docs/sql-data-sources-parquet.md: ## @@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` or by running 3.3.0 - spark.sql.parquet.timestampNTZ.enabled Review Comment: Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]
dongjoon-hyun commented on code in PR #45145: URL: https://github.com/apache/spark/pull/45145#discussion_r1493073011 ## docs/sql-data-sources-parquet.md: ## @@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` or by running 3.3.0 - spark.sql.parquet.timestampNTZ.enabled Review Comment: So, previous this was no-op, @gengliangwang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option [spark]
gengliangwang opened a new pull request, #45145: URL: https://github.com/apache/spark/pull/45145 ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/39856. The configuration changes should be reflected in the Parquet data source doc ### Why are the changes needed? To fix doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: ![Uploading image.png…]() ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]
dongjoon-hyun commented on PR #45144: URL: https://github.com/apache/spark/pull/45144#issuecomment-1949478931 How can I help you, @holdenk ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]
dongjoon-hyun commented on PR #45144: URL: https://github.com/apache/spark/pull/45144#issuecomment-1949475905 FYI, all GitHub Action works fine without any compilation failure like that. The is the result from my environment. ``` $ java -version openjdk version "17.0.10" 2024-01-16 LTS OpenJDK Runtime Environment AppleJDK-17.0.10.7.1 (build 17.0.10+7-LTS) OpenJDK 64-Bit Server VM AppleJDK-17.0.10.7.1 (build 17.0.10+7-LTS, mixed mode, sharing) $ build/sbt sbt:spark-parent> clean ;compile;catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.FilterPushdownSuite ... [info] FilterPushdownSuite: [info] - eliminate subqueries (21 milliseconds) [info] - simple push down (12 milliseconds) [info] - combine redundant filters (12 milliseconds) [info] - do not combine non-deterministic filters even if they are identical (10 milliseconds) [info] - SPARK-16164: Filter pushdown should keep the ordering in the logical plan (7 milliseconds) [info] - SPARK-16994: filter should not be pushed through limit (3 milliseconds) [info] - can't push without rewrite (11 milliseconds) [info] - nondeterministic: can always push down filter through project with deterministic field (11 milliseconds) [info] - nondeterministic: can't push down filter through project with nondeterministic field (4 milliseconds) [info] - nondeterministic: can't push down filter through aggregate with nondeterministic field (6 milliseconds) [info] - nondeterministic: push down part of filter through aggregate with deterministic field (9 milliseconds) [info] - filters: combines filters (5 milliseconds) [info] - joins: push to either side (8 milliseconds) [info] - joins: push to one side (6 milliseconds) [info] - joins: do not push down non-deterministic filters into join condition (2 milliseconds) [info] - joins: push to one side after transformCondition (6 milliseconds) [info] - joins: rewrite filter to push to either side (4 milliseconds) [info] - joins: push down left semi join (6 milliseconds) [info] - joins: push down left outer join #1 (6 milliseconds) [info] - joins: push down right outer join #1 (6 milliseconds) [info] - joins: push down left outer join #2 (5 milliseconds) [info] - joins: push down right outer join #2 (6 milliseconds) [info] - joins: push down left outer join #3 (6 milliseconds) [info] - joins: push down right outer join #3 (13 milliseconds) [info] - joins: push down left outer join #4 (6 milliseconds) [info] - joins: push down right outer join #4 (10 milliseconds) [info] - joins: push down left outer join #5 (6 milliseconds) [info] - joins: push down right outer join #5 (9 milliseconds) [info] - joins: can't push down (4 milliseconds) [info] - joins: conjunctive predicates (8 milliseconds) [info] - joins: conjunctive predicates #2 (3 milliseconds) [info] - joins: conjunctive predicates #3 (7 milliseconds) [info] - joins: push down where clause into left anti join (5 milliseconds) [info] - joins: only push down join conditions to the right of a left anti join (4 milliseconds) [info] - joins: only push down join conditions to the right of an existence join (4 milliseconds) [info] - generate: predicate referenced no generated column (9 milliseconds) [info] - generate: non-deterministic predicate referenced no generated column (7 milliseconds) [info] - generate: part of conjuncts referenced generated column (6 milliseconds) [info] - generate: all conjuncts referenced generated column (2 milliseconds) [info] - aggregate: push down filter when filter on group by expression (7 milliseconds) [info] - aggregate: don't push down filter when filter not on group by expression (5 milliseconds) [info] - aggregate: push down filters partially which are subset of group by expressions (8 milliseconds) [info] - aggregate: push down filters with alias (5 milliseconds) [info] - aggregate: push down filters with literal (9 milliseconds) [info] - aggregate: don't push down filters that are nondeterministic (7 milliseconds) [info] - SPARK-17712: aggregate: don't push down filters that are data-independent (6 milliseconds) [info] - aggregate: don't push filters if the aggregate has no grouping expressions (3 milliseconds) [info] - SPARK-32940: aggregate: push filters through first, last and collect (17 milliseconds) [info] - union (11 milliseconds) [info] - expand (19 milliseconds) [info] - predicate subquery: push down simple (15 milliseconds) [info] - predicate subquery: push down complex (8 milliseconds) [info] - SPARK-20094: don't push predicate with IN subquery into join condition (11 milliseconds) [info] - Window: predicate push down -- basic (24 milliseconds) [info] - Window: predicate push down -- predicates with compound predicate using only one column (5 milliseconds) [info] - Window: predicate push down -- multi window
Re: [PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]
dongjoon-hyun commented on PR #45144: URL: https://github.com/apache/spark/pull/45144#issuecomment-1949469508 From the following PR description, Apache Spark 4.0 only support Java 17+, doesn't it? > Building with sbt & JDK11 or 17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47077][BUILD][TEST] Fix sbt build Revert [SPARK-44445][BUILD][TESTS] [spark]
holdenk opened a new pull request, #45144: URL: https://github.com/apache/spark/pull/45144 ### What changes were proposed in this pull request? This reverts commit f5b1e37c9e6a4ec2fd897f97cd4526415e6c0e49. ### Why are the changes needed? Building with sbt & JDK11 or 17 (executed after reload & clean ;compile;catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.FilterPushdownSuite) results in [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:20:8: object WebDriver is not a member of package org.openqa.selenium [error] import org.openqa.selenium.WebDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:33:27: not found: type WebDriver [error] override var webDriver: WebDriver = _ [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:37:29: Class org.openqa.selenium.remote.AbstractDriverOptions not found - continuing with a stub. [error] val chromeOptions = new ChromeOptions [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala:24:8: object WebDriver is not a member of package org.openqa.selenium [error] import org.openqa.selenium.WebDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala:43:27: not found: type WebDriver [error] implicit var webDriver: WebDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:39:21: Class org.openqa.selenium.remote.RemoteWebDriver not found - continuing with a stub. [error] webDriver = new ChromeDriver(chromeOptions) [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala:20:28: Unused import [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg=, cat=unused-imports, site=org.apache.spark.deploy.history [error] import org.openqa.selenium.WebDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:36:8: object WebDriver is not a member of package org.openqa.selenium [error] import org.openqa.selenium.WebDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:383:29: not found: type WebDriver [error] implicit val webDriver: WebDriver = new HtmlUnitDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:37:8: Class org.openqa.selenium.WebDriver not found - continuing with a stub. [error] import org.openqa.selenium.htmlunit.HtmlUnitDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:383:45: Class org.openqa.selenium.Capabilities not found - continuing with a stub. [error] implicit val webDriver: WebDriver = new HtmlUnitDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:470:9: Symbol 'type org.openqa.selenium.WebDriver' is missing from the classpath. [error] This symbol is required by 'value org.scalatestplus.selenium.WebBrowser.go.driver'. [error] Make sure that type WebDriver is in your classpath and check for conflicting dependencies with `-Ylog-classpath`. [error] A full rebuild may help if 'WebBrowser.class' was compiled against an incompatible version of org.openqa.selenium. [error] go to target.toExternalForm [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:470:12: could not find implicit value for parameter driver: org.openqa.selenium.WebDriver [error] go to target.toExternalForm [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:36:28: Unused import [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg=, cat=unused-imports, site=org.apache.spark.deploy.history [error] import org.openqa.selenium.WebDriver [error] ^ [error] /home/holden/repos/spark/core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala:24:28: Unused import [error] A
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1493050589 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by Review Comment: Good point! Would you mind if I file a follow-up pr for documentation? We might also want to edit the PySpark debugging guide together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1493049117 ## python/pyspark/sql/session.py: ## @@ -906,6 +907,12 @@ def dataSource(self) -> "DataSourceRegistration": return DataSourceRegistration(self) +@property +def profile(self) -> "Profile": Review Comment: I planned to make a follow-up PR for Connect. But now I think I'll just add Connect changes to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1493048620 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession Review Comment: Yes after removing the old APIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1493048037 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: +""" +Show the profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +type : str, optional +The profiler type, which can be either "perf" or "memory". +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +if type == "memory": +self.sparkSession.showMemoryProfiles(id) Review Comment: Sounds good! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1493048255 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: Review Comment: Your suggestion makes more sense. Thanks! ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: +""" +Show the profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +type : str, optional +The profiler type, which can be either "perf" or "memory". +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +if type == "memory": +self.sparkSession.showMemoryProfiles(id) +elif type == "perf" or type is None: +self.sparkSession.showPerfProfiles(id) +if type is None: # Show both perf and memory profiles +self.sparkSession.showMemoryProfiles(id) +else: +raise PySparkValueError( +error_class="VALUE_NOT_ALLOWED", +message_parameters={ +"arg_name": "type", +"allowed_values": str(["perf", "memory"]), +}, +) + +def dump(self, path: str, *, type: Optional[str] = None, id: Optional[int] = None) -> None: Review Comment: Adjusted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47032][Python] Prototype for adding pass-through columns to Python UDTF API [spark]
dongjoon-hyun commented on code in PR #45142: URL: https://github.com/apache/spark/pull/45142#discussion_r1493041669 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala: ## @@ -195,13 +198,13 @@ case class PythonUDTF( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): PythonUDTF = copy(children = newChildren) -} Review Comment: This removal looks like a mistake. Could you make CIs happy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun closed pull request #45143: [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir URL: https://github.com/apache/spark/pull/45143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun commented on PR #45143: URL: https://github.com/apache/spark/pull/45143#issuecomment-1949422724 I addresses the comments. Let me merge this and keep monitoring this suite. Thank you again. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun commented on code in PR #45143: URL: https://github.com/apache/spark/pull/45143#discussion_r1493032810 ## core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala: ## @@ -390,6 +390,8 @@ abstract class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with // a new conf is used with the background thread set and running at its fastest // allowed refresh rate (1Hz) stop() +Utils.deleteRecursively(storeDir) +assert(storeDir.mkdir()) Review Comment: Sure. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
viirya commented on code in PR #45143: URL: https://github.com/apache/spark/pull/45143#discussion_r1493032622 ## core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala: ## @@ -390,6 +390,8 @@ abstract class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with // a new conf is used with the background thread set and running at its fastest // allowed refresh rate (1Hz) stop() +Utils.deleteRecursively(storeDir) +assert(storeDir.mkdir()) Review Comment: Looks okay. Maybe add a comment on top of it explaining why it is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun commented on PR #45143: URL: https://github.com/apache/spark/pull/45143#issuecomment-1949403179 BTW, I'm still looking at this test suite because the failure is not reproducible in local env yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun commented on PR #45143: URL: https://github.com/apache/spark/pull/45143#issuecomment-1949399210 Could you review this test case PR, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun commented on PR #45143: URL: https://github.com/apache/spark/pull/45143#issuecomment-1949398958 `core` passed. ![Screenshot 2024-02-16 at 14 03 10](https://github.com/apache/spark/assets/9700541/36e7a7de-20e4-4115-aea6-65bc8d100e5b) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]
xinrong-meng commented on PR #45138: URL: https://github.com/apache/spark/pull/45138#issuecomment-1949325532 Good to know, thanks @dongjoon-hyun ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47076][CORE][TESTS] Fix HistoryServerSuite.`incomplete apps get refreshed` test to start with empty storeDir [spark]
dongjoon-hyun opened a new pull request, #45143: URL: https://github.com/apache/spark/pull/45143 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]
dongjoon-hyun commented on PR #45138: URL: https://github.com/apache/spark/pull/45138#issuecomment-1949306231 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]
dongjoon-hyun closed pull request #45138: [SPARK-47075][BUILD] Add `derby-provided` profile URL: https://github.com/apache/spark/pull/45138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]
dongjoon-hyun commented on PR #45138: URL: https://github.com/apache/spark/pull/45138#issuecomment-1949304069 Thank you, @xinrong-meng . We can skip this because we only document `hadoop-provided` so far. ``` $ git grep '\-provided' assembly assembly/pom.xml: hadoop-provided assembly/pom.xml: hive-provided assembly/pom.xml: orc-provided assembly/pom.xml: parquet-provided ``` ``` $ git grep hadoop-provided docs/ docs/building-spark.md:The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects, docs/index.md:[by augmenting Spark's classpath](hadoop-provided.html). $ git grep hive-provided docs/ $ git grep orc-provided docs/ $ git grep parquet-provided docs/ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]
xinrong-meng commented on PR #45138: URL: https://github.com/apache/spark/pull/45138#issuecomment-1949299783 LGTM, thanks @dongjoon-hyun . Out of curiosity, do we have documentation on those build profiles? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]
dongjoon-hyun closed pull request #45137: [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts URL: https://github.com/apache/spark/pull/45137 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]
dongjoon-hyun commented on PR #45137: URL: https://github.com/apache/spark/pull/45137#issuecomment-1949296923 Thank you, @xinrong-meng ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]
xinrong-meng commented on PR #45137: URL: https://github.com/apache/spark/pull/45137#issuecomment-1949296302 LGTM, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]
sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1492925832 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -131,6 +141,15 @@ trait StateStore extends ReadStateStore { def remove(key: UnsafeRow, colFamilyName: String = StateStore.DEFAULT_COL_FAMILY_NAME): Unit + /** + * Merges the provided value with existing values of a non-null key. Review Comment: Makes sense, done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]
sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1492924832 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -67,6 +67,15 @@ trait ReadStateStore { def get(key: UnsafeRow, colFamilyName: String = StateStore.DEFAULT_COL_FAMILY_NAME): UnsafeRow + /** + * Provides an iterator containing all values of a non-null key. Review Comment: Sounds good. Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]
sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1492922006 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala: ## @@ -75,6 +76,61 @@ private[sql] class RocksDBStateStoreProvider value } +/** + * Provides an iterator containing all values of a non-null key. + * + * Inside RocksDB, the values are merged together and stored as a byte Array. + * This operation relies on state store value encoder to be able to split the + * single array into multiple values. + * + * Also see [[MultiValuedStateEncoder]] which supports encoding/decoding multiple + * values per key. + */ +override def valuesIterator(key: UnsafeRow, colFamilyName: String): Iterator[UnsafeRow] = { + verify(key != null, "Key cannot be null") + + val kvEncoder = keyValueEncoderMap.get(colFamilyName) + val valueEncoder = kvEncoder._2 + val keyEncoder = kvEncoder._1 + + verify(valueEncoder.supportsMultipleValuesPerKey, "valuesIterator requires a encoder " + + "that supports multiple values for a single key.") + val encodedKey = rocksDB.get(keyEncoder.encodeKey(key), colFamilyName) + val valueIterator = valueEncoder.decodeValues(encodedKey) + + if (valueIterator.nonEmpty) { +new Iterator[UnsafeRow] { + override def hasNext: Boolean = { +valueIterator.hasNext + } + + override def next(): UnsafeRow = { +val value = valueIterator.next() +if (value != null) { + StateStoreProvider.validateStateRowFormat( Review Comment: Removed the wrapper iterator, we don't need it anymore. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala: ## @@ -75,6 +76,61 @@ private[sql] class RocksDBStateStoreProvider value } +/** + * Provides an iterator containing all values of a non-null key. + * + * Inside RocksDB, the values are merged together and stored as a byte Array. + * This operation relies on state store value encoder to be able to split the + * single array into multiple values. + * + * Also see [[MultiValuedStateEncoder]] which supports encoding/decoding multiple + * values per key. + */ +override def valuesIterator(key: UnsafeRow, colFamilyName: String): Iterator[UnsafeRow] = { + verify(key != null, "Key cannot be null") + + val kvEncoder = keyValueEncoderMap.get(colFamilyName) + val valueEncoder = kvEncoder._2 + val keyEncoder = kvEncoder._1 + + verify(valueEncoder.supportsMultipleValuesPerKey, "valuesIterator requires a encoder " + + "that supports multiple values for a single key.") + val encodedKey = rocksDB.get(keyEncoder.encodeKey(key), colFamilyName) + val valueIterator = valueEncoder.decodeValues(encodedKey) + + if (valueIterator.nonEmpty) { +new Iterator[UnsafeRow] { + override def hasNext: Boolean = { +valueIterator.hasNext + } + + override def next(): UnsafeRow = { +val value = valueIterator.next() +if (value != null) { + StateStoreProvider.validateStateRowFormat( Review Comment: Removed the wrapper iterator, we don't need it anymore. Thanks for pointing this out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]
sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1492920813 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImpl.scala: ## @@ -77,9 +50,7 @@ class ValueStateImpl[S]( override def getOption(): Option[S] = { val retRow = getImpl() Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]
sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1489800175 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import org.apache.commons.lang3.SerializationUtils + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.execution.streaming.state.StateStoreErrors +import org.apache.spark.sql.types.{BinaryType, StructType} + +/** + * Helper object providing APIs to encodes the grouping key, and user provided values + * to Spark [[UnsafeRow]]. + */ +object StateTypesEncoderUtils { + + private val KEY_ROW_SCHEMA: StructType = new StructType().add("key", BinaryType) + private val VALUE_ROW_SCHEMA: StructType = new StructType().add("value", BinaryType) + + // TODO: validate places that are trying to encode the key and check if we can eliminate/ + // add caching for some of these calls. + def encodeGroupingKey(stateName: String, keyExprEnc: ExpressionEncoder[Any]): UnsafeRow = { +val keyOption = ImplicitGroupingKeyTracker.getImplicitKeyOption +if (keyOption.isEmpty) { + throw StateStoreErrors.implicitKeyNotFound(stateName) +} + +val toRow = keyExprEnc.createSerializer() +val keyByteArr = toRow + .apply(keyOption.get).asInstanceOf[UnsafeRow].getBytes() + +val keyEncoder = UnsafeProjection.create(KEY_ROW_SCHEMA) Review Comment: We do not anticipate a different schema. Every key, value will be encoded to a byte Array. While working on refactoring key/value Encoder to move them to object level, I discovered that `UnsafeProjection` is not thread safe. Hence, we cannot use it across state variables. Also, the user might try to do something weird - and access state data concurrently inside `transformWithState`. I think we should create a projection per encoding. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-47032][Python] Prototype for adding pass-through columns to Python UDTF API [spark]
dtenedor opened a new pull request, #45142: URL: https://github.com/apache/spark/pull/45142 ### What changes were proposed in this pull request? [WIP] This is a prototype for adding pass-through columns to Python UDTF API. We'll develop it more before sending out for formal review. ### Why are the changes needed? We can use this API to forward column values from the UDTF input table to the output table directly, bypassing JVM/Python interchange and improving performance. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds test coverage. ### Was this patch authored or co-authored using generative AI tooling? NO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47057][PYTHON] Reeanble MyPy data test [spark]
ueshin commented on PR #45135: URL: https://github.com/apache/spark/pull/45135#issuecomment-1949130300 We should also upgrade the minimum version? https://github.com/apache/spark/blob/3901c6e1c63f300066847ae9393a8b424d7bb0c0/dev/lint-python#L21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47043][BUILD] add `jackson-core` and `jackson-annotations` dependencies to module `spark-common-utils` [spark]
William1104 commented on PR #45103: URL: https://github.com/apache/spark/pull/45103#issuecomment-1949062510 Hi @dongjoon-hyun , Thanks for your explanation. Yes, the spark is a large and very dynamic community. It could be challenging to enforce this, especially since some modules already have more than required dependencies. I would propose fixing dependencies issue module by module, and then enforcing it in a systematic way (via maven dependency plugin). Let me send an email to d...@spark.apache.org on this topic. Thanks a lot. Best regards, William -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070] Fix invalid aggregation after in-subquery rewrite [spark]
agubichev commented on PR #45133: URL: https://github.com/apache/spark/pull/45133#issuecomment-1949032191 @jchen5 @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Show sort order of NaN relative to infinity [spark]
nchammas commented on PR #45047: URL: https://github.com/apache/spark/pull/45047#issuecomment-1948964067 cc @cloud-fan and @peter-toth since we have been discussing NaNs over on #45036. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]
dongjoon-hyun commented on PR #45117: URL: https://github.com/apache/spark/pull/45117#issuecomment-1948944084 Thank you, @grundprinzip . Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]
dongjoon-hyun closed pull request #45117: [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh URL: https://github.com/apache/spark/pull/45117 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]
dongjoon-hyun closed pull request #45141: [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` URL: https://github.com/apache/spark/pull/45141 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]
dongjoon-hyun commented on PR #45141: URL: https://github.com/apache/spark/pull/45141#issuecomment-1948939396 Merged to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47042][BUILD] add missing explicit dependency 'commons-lang3' to the module 'spark-common-utils' [spark]
dongjoon-hyun commented on PR #45101: URL: https://github.com/apache/spark/pull/45101#issuecomment-1948933031 Here is a comment about your question. - https://github.com/apache/spark/pull/45103#issuecomment-1948928099 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47043][BUILD] add `jackson-core` and `jackson-annotations` dependencies to module `spark-common-utils` [spark]
dongjoon-hyun commented on PR #45103: URL: https://github.com/apache/spark/pull/45103#issuecomment-1948928099 Unlike other communities like Apache ORC, Apache Spark community doesn't take advantage of `maven-dependency-plugin` to enforce those `Used undeclared` and `Unused declared` and `Non-test scoped test only dependencies`. The main reason is because it would be quite inconvenient in a large and dynamic community like Apache Spark community. Apache ORC Example: https://github.com/apache/orc/blob/0a02e4cde165b81fb93fc99456a130da4625ef30/java/core/pom.xml#L160-L172 In addition, if you want to propose it, it should be applied systematically like Apache ORC community for all modules, not like many small PRs like this PR and #45101 . However, if you have a concern about this area and want to contribute that area, I'd like to recommend you to send an email to d...@spark.apache.org . We can discuss more and build a consensus on the necessity of your proposal, @William1104 . You can use your PRs' links in your email. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]
dongjoon-hyun commented on PR #45136: URL: https://github.com/apache/spark/pull/45136#issuecomment-1948876938 Thank you, @LuciferYang ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47043][BUILD] add `jackson-core` and `jackson-annotations` dependencies to module `spark-common-utils` [spark]
William1104 commented on PR #45103: URL: https://github.com/apache/spark/pull/45103#issuecomment-1948854042 Hi @dongjoon-hyun, Thank you for providing the script. If I understand correctly, the script is designed to check if any dependencies have been updated in an unexpected manner for every profile. This functionality is extremely valuable as it helps us avoid making careless mistakes and provides the reviewer with a clearer understanding of how dependencies will be updated. It appears that several Spark modules are using libraries without proper declaration, or they have declared certain libraries but are not actually using them. To generate a report highlighting these issues, I ran the following command: ``` ./build/mvn dependency:analyze | sed -n '/<<< dependency:.*:analyze/,/>>> dependency:.*:analyze/p' > dependency-analyze ``` In the "dependency-analyze" report, let's take the module "spark-common-utils" as an example. It currently has a compile scope dependency on "commons-io," which could be changed to the test scope to avoid unnecessary transitive dependencies. Here are the relevant excerpts from the report: ``` [INFO] --- dependency:3.6.1:analyze (default-cli) @ spark-common-utils_2.13 --- [WARNING] Used undeclared dependencies found: [WARNING] com.fasterxml.jackson.core:jackson-annotations:jar:2.16.1:compile [WARNING]org.apache.commons:commons-lang3:jar:3.14.0:compile [WARNING]com.fasterxml.jackson.core:jackson-core:jar:2.16.1:compile [WARNING]org.scala-lang:scala-library:jar:2.13.12:compile [WARNING]org.scalatest:scalatest-funsuite_2.13:jar:3.2.17:test [WARNING]org.scalactic:scalactic_2.13:jar:3.2.17:test [WARNING]org.scalatest:scalatest-compatible:jar:3.2.17:test [WARNING]org.scalatest:scalatest-core_2.13:jar:3.2.17:test [WARNING] Unused declared dependencies found: [WARNING] com.fasterxml.jackson.module:jackson-module-scala_2.13:jar:2.16.1:compile [WARNING]oro:oro:jar:2.0.8:compile [WARNING]org.slf4j:jul-to-slf4j:jar:2.0.11:compile [WARNING]org.slf4j:jcl-over-slf4j:jar:2.0.11:compile [WARNING]org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.22.1:compile [WARNING]org.apache.logging.log4j:log4j-1.2-api:jar:2.22.1:compile [WARNING]org.spark-project.spark:unused:jar:1.0.0:compile [WARNING]org.scalatest:scalatest_2.13:jar:3.2.17:test [WARNING]org.scalatestplus:scalacheck-1-17_2.13:jar:3.2.17.0:test [WARNING]org.scalatestplus:mockito-4-11_2.13:jar:3.2.17.0:test [WARNING]org.scalatestplus:selenium-4-12_2.13:jar:3.2.17.0:test [WARNING]org.junit.jupiter:junit-jupiter:jar:5.9.3:test [WARNING]net.aichler:jupiter-interface:jar:0.11.1:test [WARNING] Non-test scoped test only dependencies found: [WARNING]commons-io:commons-io:jar:2.15.1:compile [INFO] ``` This information suggests that there are both used undeclared dependencies and unused declared dependencies. Additionally, there are non-test scoped test-only dependencies, such as "commons-io:commons-io:jar:2.15.1:compile." These findings can help us identify areas where we can optimize and refine the dependency management within the "spark-common-utils" module. I would like to create PR to fix the dependency scope. Thanks and regards, William -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]
LuciferYang commented on PR #45141: URL: https://github.com/apache/spark/pull/45141#issuecomment-1948673492 cc @HeartSaVioR @srowen @amaliujia Actually, I didn't reproduce this issue locally because when I use Maven to test the branch-3.5 `connect` module, the order of test case execution is `SparkConnectStreamingQueryCacheSuite`, `ExecuteEventsManagerSuite`, `SparkConnectProtoSuite`... and there are no `DataFrame` instances in `SparkConnectStreamingQueryCacheSuite` and `ExecuteEventsManagerSuite`. Therefore, `sparkTestRelation` in `SparkConnectProtoSuite` is still the first `DataFrame` to be initialized, its `id` is 0, which bypasses this issue. However, we can use some tricky methods to reproduce the failure. For example, change https://github.com/apache/spark/blob/1c1c5faa29dc649faf143fe2eea39ccf15862f85/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala#L90-L94 to ```scala test("Basic select") { val connectPlan = connectTestRelation.select("id".protoAttr) val sparkPlan = sparkTestRelation2.select("id") comparePlans(connectPlan, sparkPlan) } ``` In this way, `sparkTestRelation` will definitely not be the first `DataFrame` to be initialized, and the test failure can be reproduced, but `Basic select` will still pass the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` [spark]
LuciferYang opened a new pull request, #45141: URL: https://github.com/apache/spark/pull/45141 ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation , [id#0, name#0] +- LocalRelation , [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]
MaxGekk commented on PR #45127: URL: https://github.com/apache/spark/pull/45127#issuecomment-1948471185 Here are backposts to: - 3.4 https://github.com/apache/spark/pull/45140 - 3.5 https://github.com/apache/spark/pull/45139 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47072][SQL][3.4] Fix supported interval formats in error messages [spark]
MaxGekk opened a new pull request, #45140: URL: https://github.com/apache/spark/pull/45140 ### What changes were proposed in this pull request? In the PR, I propose to add one more field to keys of `supportedFormat` in `IntervalUtils` because current implementation has duplicate keys that overwrites each other. For instance, the following keys are the same: ``` (YM.YEAR, YM.MONTH) ... (DT.DAY, DT.HOUR) ``` because `YM.YEAR = DT.DAY = 0` and `YM.MONTH = DT.HOUR = 1`. This is a backport of https://github.com/apache/spark/pull/45127. ### Why are the changes needed? To fix the incorrect error message when Spark cannot parse ANSI interval string. For example, the expected format should be some year-month format but Spark outputs day-time one: ```sql spark-sql (default)> select interval '-\t2-2\t' year to month; Interval string does not match year-month format of `[+|-]d h`, `INTERVAL [+|-]'[+|-]d h' DAY TO HOUR` when cast to interval year to month: -2-2 . (line 1, pos 16) == SQL == select interval '-\t2-2\t' year to month ^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the existing test suite: ``` $ build/sbt "test:testOnly *IntervalUtilsSuite" ``` and regenerating the golden files: ``` $ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk (cherry picked from commit 074fcf2807000d342831379de0fafc1e49a6bf19) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]
LuciferYang commented on PR #45136: URL: https://github.com/apache/spark/pull/45136#issuecomment-1948455390 Merged into master for Spark 4.0. Thanks @dongjoon-hyun ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]
LuciferYang closed pull request #45136: [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions URL: https://github.com/apache/spark/pull/45136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]
grundprinzip commented on code in PR #45117: URL: https://github.com/apache/spark/pull/45117#discussion_r1492430978 ## sbin/start-connect-server.sh: ## @@ -40,8 +40,9 @@ fi if [ "$1" == "--wait" ]; then shift - exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect Server" "$@" -else - exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" + export SPARK_NO_DAEMONIZE=1 fi + exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47072][SQL][3.5] Fix supported interval formats in error messages [spark]
MaxGekk opened a new pull request, #45139: URL: https://github.com/apache/spark/pull/45139 ### What changes were proposed in this pull request? In the PR, I propose to add one more field to keys of `supportedFormat` in `IntervalUtils` because current implementation has duplicate keys that overwrites each other. For instance, the following keys are the same: ``` (YM.YEAR, YM.MONTH) ... (DT.DAY, DT.HOUR) ``` because `YM.YEAR = DT.DAY = 0` and `YM.MONTH = DT.HOUR = 1`. This is a backport of https://github.com/apache/spark/pull/45127. ### Why are the changes needed? To fix the incorrect error message when Spark cannot parse ANSI interval string. For example, the expected format should be some year-month format but Spark outputs day-time one: ```sql spark-sql (default)> select interval '-\t2-2\t' year to month; Interval string does not match year-month format of `[+|-]d h`, `INTERVAL [+|-]'[+|-]d h' DAY TO HOUR` when cast to interval year to month: -2-2 . (line 1, pos 16) == SQL == select interval '-\t2-2\t' year to month ^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the existing test suite: ``` $ build/sbt "test:testOnly *IntervalUtilsSuite" ``` and regenerating the golden files: ``` $ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk (cherry picked from commit 074fcf2807000d342831379de0fafc1e49a6bf19) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]
MaxGekk closed pull request #45127: [SPARK-47072][SQL] Fix supported interval formats in error messages URL: https://github.com/apache/spark/pull/45127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]
MaxGekk commented on PR #45127: URL: https://github.com/apache/spark/pull/45127#issuecomment-1948209316 Merging to master. Thank you, @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47075][BUILD] Add `derby-provided` profile [spark]
dongjoon-hyun opened a new pull request, #45138: URL: https://github.com/apache/spark/pull/45138 ### What changes were proposed in this pull request? This PR aims to add `derby-provided` profile. ### Why are the changes needed? To help the users add their derby binary easily. ### Does this PR introduce _any_ user-facing change? No, this is a new additional profile. ### How was this patch tested? Manual. ``` $ dev/make-distribution.sh -Phive,derby-provided $ ls dist/jars/derby* zsh: no matches found: dist/jars/derby* ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]
dongjoon-hyun commented on PR #45137: URL: https://github.com/apache/spark/pull/45137#issuecomment-1948169473 Could you review this comment-update PR, @HyukjinKwon ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47074][INFRA] Fix outdated comments in GitHub Action scripts [spark]
dongjoon-hyun opened a new pull request, #45137: URL: https://github.com/apache/spark/pull/45137 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47057][PYTHON] Reeanble MyPy data test [spark]
dongjoon-hyun commented on PR #45135: URL: https://github.com/apache/spark/pull/45135#issuecomment-1948102134 It seems that there are more leftover~ Could you take a look at them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47073][BUILD] Upgrade several Maven plugins to the latest versions [spark]
dongjoon-hyun commented on PR #45136: URL: https://github.com/apache/spark/pull/45136#issuecomment-1947992760 Could you review this when you have some time, @LuciferYang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47060][SQL][TESTS] Check `SparkIllegalArgumentException` instead of `IllegalArgumentException` in `catalyst` [spark]
MaxGekk closed pull request #45118: [SPARK-47060][SQL][TESTS] Check `SparkIllegalArgumentException` instead of `IllegalArgumentException` in `catalyst` URL: https://github.com/apache/spark/pull/45118 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47060][SQL][TESTS] Check `SparkIllegalArgumentException` instead of `IllegalArgumentException` in `catalyst` [spark]
MaxGekk commented on PR #45118: URL: https://github.com/apache/spark/pull/45118#issuecomment-1947972568 Merging to master. Thank you, @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47073][BUILD] Upgrade `versions-maven-plugin` to 2.16.2 [spark]
dongjoon-hyun opened a new pull request, #45136: URL: https://github.com/apache/spark/pull/45136 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]
dongjoon-hyun closed pull request #45134: [SPARK-47071][SQL] Inline With expression if it contains special expression URL: https://github.com/apache/spark/pull/45134 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]
dongjoon-hyun commented on PR #45134: URL: https://github.com/apache/spark/pull/45134#issuecomment-1947950945 `org.apache.spark.sql.SparkSessionE2ESuite` failure is a known flaky one. ![Screenshot 2024-02-16 at 00 26 56](https://github.com/apache/spark/assets/9700541/63207d05-59e9-496c-9150-240208ed2010) Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]
itholic commented on code in PR #44881: URL: https://github.com/apache/spark/pull/44881#discussion_r1492108932 ## python/pyspark/pandas/plot/matplotlib.py: ## @@ -363,10 +364,23 @@ def _args_adjust(self): if is_list_like(self.bottom): self.bottom = np.array(self.bottom) +def _ensure_frame(self, data): +return data + +def _calculate_bins(self, data, bins): +return bins Review Comment: Pandas recently pushed couple of commits for refactoring the internal plotting structure such as https://github.com/pandas-dev/pandas/pull/55850 or https://github.com/pandas-dev/pandas/pull/55872, so we also should inherits couple of internal methods to follow the latest Pandas behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]
allisonwang-db commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1492090151 ## python/pyspark/sql/streaming/python_streaming_source_runner.py: ## @@ -0,0 +1,160 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import sys +import json +from typing import IO + +from pyspark.accumulators import _accumulatorRegistry +from pyspark.errors import PySparkAssertionError, PySparkRuntimeError +from pyspark.java_gateway import local_connect_and_auth +from pyspark.serializers import ( +read_int, +write_int, +write_with_length, +SpecialLengths, +) +from pyspark.sql.datasource import DataSource, DataSourceStreamReader +from pyspark.sql.types import ( +_parse_datatype_json_string, +StructType, +) +from pyspark.util import handle_worker_exception +from pyspark.worker_util import ( +check_python_version, +read_command, +pickleSer, +send_accumulator_updates, +setup_memory_limits, +setup_spark_files, +utf8_deserializer, +) + +initial_offset_func_id = 884 +latest_offset_func_id = 885 +partitions_func_id = 886 +commit_func_id = 887 + + +def initial_offset_func(reader: DataSourceStreamReader, outfile: IO) -> None: +offset = reader.initialOffset() +write_with_length(json.dumps(offset).encode("utf-8"), outfile) + + +def latest_offset_func(reader: DataSourceStreamReader, outfile: IO) -> None: +offset = reader.latestOffset() +write_with_length(json.dumps(offset).encode("utf-8"), outfile) + + +def partitions_func(reader: DataSourceStreamReader, infile: IO, outfile: IO) -> None: +start_offset = json.loads(utf8_deserializer.loads(infile)) +end_offset = json.loads(utf8_deserializer.loads(infile)) +partitions = reader.partitions(start_offset, end_offset) +# Return the serialized partition values. +write_int(len(partitions), outfile) +for partition in partitions: +pickleSer._write_with_length(partition, outfile) + + +def commit_func(reader: DataSourceStreamReader, infile: IO, outfile: IO) -> None: +end_offset = json.loads(utf8_deserializer.loads(infile)) +reader.commit(end_offset) +write_int(0, outfile) + + +def main(infile: IO, outfile: IO) -> None: +try: +check_python_version(infile) +setup_spark_files(infile) + +memory_limit_mb = int(os.environ.get("PYSPARK_PLANNER_MEMORY_MB", "-1")) +setup_memory_limits(memory_limit_mb) + +_accumulatorRegistry.clear() + +# Receive the data source instance. +data_source = read_command(pickleSer, infile) + +if not isinstance(data_source, DataSource): +raise PySparkAssertionError( +error_class="PYTHON_DATA_SOURCE_TYPE_MISMATCH", +message_parameters={ +"expected": "a Python data source instance of type 'DataSource'", +"actual": f"'{type(data_source).__name__}'", +}, +) + +# Receive the data source output schema. +schema_json = utf8_deserializer.loads(infile) +schema = _parse_datatype_json_string(schema_json) +if not isinstance(schema, StructType): +raise PySparkAssertionError( +error_class="PYTHON_DATA_SOURCE_TYPE_MISMATCH", +message_parameters={ +"expected": "an output schema of type 'StructType'", +"actual": f"'{type(schema).__name__}'", +}, +) + +# Instantiate data source reader. +try: +reader = data_source.streamReader(schema=schema) +# Initialization succeed. +write_int(0, outfile) +outfile.flush() + +# handle method call from socket +while True: +func_id = read_int(infile) +if func_id == initial_offset_func_id: +initial_offset_func(reader, outfile) +elif func_id == latest_offset_func_id: +latest_offset_func(reader, outfile) +elif func_id == partitions_func_id: +partitions_func(reader, infile, outfile) +elif func_id == commit_func_