Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
MaxGekk commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1489031135 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on PR #45023: URL: https://github.com/apache/spark/pull/45023#issuecomment-1943113395 Could you please check the GA build result and fix accordingly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-46906][SS] Add a check for stateful operator change for streaming [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on code in PR #44927: URL: https://github.com/apache/spark/pull/44927#discussion_r1488847231 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -82,6 +84,39 @@ class IncrementalExecution(

Re: [PR] [SPARK-47036][SS] Cleanup RocksDB file tracking for previously uploaded files if files were deleted from local directory [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on code in PR #45092: URL: https://github.com/apache/spark/pull/45092#discussion_r1488831105 ## sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala: ## @@ -1863,6 +1864,91 @@ class RocksDBSuite extends

Re: [PR] [SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect` [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on PR #43210: URL: https://github.com/apache/spark/pull/43210#issuecomment-1942967890 The error message I've seen was following: ``` [autosummary] failed to import 'pyspark.ml.connect.classification.LogisticRegression': no module named

Re: [PR] [SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect` [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on PR #43210: URL: https://github.com/apache/spark/pull/43210#issuecomment-1942961137 It seems like pyspark docs build is failing due to this - during running release script against branch-3.5. I can see the docs build pass after reverting this commit. It's really

Re: [PR] [SPARK-46820][PYTHON] Fix error message regression by restoring `new_msg` [spark]

2024-02-13 Thread via GitHub
itholic commented on code in PR #44859: URL: https://github.com/apache/spark/pull/44859#discussion_r1488780632 ## python/pyspark/sql/types.py: ## @@ -2214,12 +2211,9 @@ def verify_acceptable_types(obj: Any) -> None: # subclass of them can not be fromInternal in JVM

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-13 Thread via GitHub
itholic commented on PR #44881: URL: https://github.com/apache/spark/pull/44881#issuecomment-1942942082 Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of behavior changes  Let me fix them. Thanks for the confirm! -- This is an automated message from the Apache

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942919284 You're welcome. Feel free to ping me again on this PR. I'll be here Today for support. -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
jingz-db commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942918594 >BTW, in the community, we trust CIs as the ground truth. This makes sense, I am double checking. Thanks for the quick response! -- This is an automated message from the Apache

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942911238 BTW, in the community, we trust CIs as the ground truth. Does your GitHub Action also fail like you mentioned? -- This is an automated message from the Apache Git Service. To

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942910281 Could you clear up your Maven or Ivy cache? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
jingz-db commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942909715 I just tried `build/sbt clean package` and then `build/sbt "sql/testOnly org.apache.spark.sql.execution.python.PythonDataSourceSuite"`, it still gives the same error as above. And I

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942909458 Also, to @cloud-fan and @HyukjinKwon , could you double-check with @jingz-db and @chaoqin-li1123 ? I can help you if there is a reproducible example in Apache Spark master branch.

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942908662 To @jingz-db and @chaoqin-li1123 , are you sure that you are using Apache Spark `master` instead of `Databricks` master? -- This is an automated message from the Apache Git

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942907749 I also tried the following. It succeeded like the following too. ``` $ build/sbt ... sbt:spark-parent> testOnly

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942903356 For the record, the following is the result from Apache Spark master branch. ``` $ git log --oneline -n1 63b97c6ad82 (HEAD -> master, apache/master, apache/HEAD)

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942900495 So, something like this? ``` $ build/sbt "sql/testOnly org.apache.spark.sql.execution.python.PythonDataSourceSuite" ``` -- This is an automated message from the Apache

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
chaoqin-li1123 commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942899902 `built/sbt` to enter the scala shell, and `testOnly org.apache.spark.sql.execution.python.PythonDataSourceSuite` to run the test within the scala shell. @dongjoon-hyun -- This

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942897094 Ur, a full command please, @chaoqin-li1123 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
chaoqin-li1123 commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942896374 Thanks @dongjoon-hyun My command is > build/sbt >> testOnly org.apache.spark.sql.execution.python.PythonDataSourceSuite -- This is an automated message from the

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942895237 I can help you when you provide a reproducible procedure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942894624 Please give me a reproducible command line, @jingz-db . :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
jingz-db commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942893943 Hi @dongjoon-hyun , similar error also happens on my local env with errors below: ```scala [error]

Re: [PR] [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession [spark]

2024-02-13 Thread via GitHub
ueshin commented on code in PR #45073: URL: https://github.com/apache/spark/pull/45073#discussion_r1488746220 ## python/pyspark/sql/profiler.py: ## @@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults": """ ... +def dump_perf_profiles(self,

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942893118 What is your command, @chaoqin-li1123 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` [spark]

2024-02-13 Thread via GitHub
chaoqin-li1123 commented on PR #45079: URL: https://github.com/apache/spark/pull/45079#issuecomment-1942892329 It seems that this commit break my sbt build in latest master branch The error message is

Re: [PR] [Don't merge & review] verify sbt on master [spark]

2024-02-13 Thread via GitHub
github-actions[bot] commented on PR #43079: URL: https://github.com/apache/spark/pull/43079#issuecomment-1942891730 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

Re: [PR] [SPARK-45782][CORE][PYTHON] Add Dataframe API df.explainString() [spark]

2024-02-13 Thread via GitHub
github-actions[bot] closed pull request #43651: [SPARK-45782][CORE][PYTHON] Add Dataframe API df.explainString() URL: https://github.com/apache/spark/pull/43651 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[PR] [SS] Add MapState implementation for State API v2. [spark]

2024-02-13 Thread via GitHub
jingz-db opened a new pull request, #45094: URL: https://github.com/apache/spark/pull/45094 ### What changes were proposed in this pull request? This PR adds changes for MapState implementation in State Api v2. This implementation adds a new encoder/decoder to encode grouping

Re: [PR] [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession [spark]

2024-02-13 Thread via GitHub
xinrong-meng commented on code in PR #45073: URL: https://github.com/apache/spark/pull/45073#discussion_r1488693924 ## python/pyspark/sql/profiler.py: ## @@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults": """ ... +def

Re: [PR] [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession [spark]

2024-02-13 Thread via GitHub
xinrong-meng commented on code in PR #45073: URL: https://github.com/apache/spark/pull/45073#discussion_r1488687912 ## python/pyspark/sql/profiler.py: ## @@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults": """ ... +def

Re: [PR] [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession [spark]

2024-02-13 Thread via GitHub
xinrong-meng commented on code in PR #45073: URL: https://github.com/apache/spark/pull/45073#discussion_r1488687912 ## python/pyspark/sql/profiler.py: ## @@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults": """ ... +def

[PR] [SPARK-47037] ] Fix AliasAwareOutputExpression outputPartitioning [spark]

2024-02-13 Thread via GitHub
liorregev opened a new pull request, #45093: URL: https://github.com/apache/spark/pull/45093 AliasAwareOutputExpression does not detect that `select(F.struct($"my_field"))` retains partitioning in case the dataset was partitioning by `$"my_field"` before the select. This causes an

Re: [PR] [SPARK-46979][SS] Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider [spark]

2024-02-13 Thread via GitHub
HeartSaVioR closed pull request #45038: [SPARK-46979][SS] Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider URL: https://github.com/apache/spark/pull/45038 -- This is an automated message from the Apache Git Service. To

Re: [PR] [SPARK-46979][SS] Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on PR #45038: URL: https://github.com/apache/spark/pull/45038#issuecomment-1942570003 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SS][SPARK-47036] Cleanup RocksDB file tracking for previously uploaded files if files were deleted from local directory [spark]

2024-02-13 Thread via GitHub
sahnib commented on PR #45092: URL: https://github.com/apache/spark/pull/45092#issuecomment-1942560146 cc: @HeartSaVioR PTAL, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
MaxGekk commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1488560783 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -1410,6 +1422,13 @@ public boolean equals(final Object other) { } } +

Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
dbatomic commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1488536286 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -1410,6 +1422,13 @@ public boolean equals(final Object other) { } } +

[PR] [SS} Cleanup RocksDB file tracking for previously uploaded files if files were deleted from local directory [spark]

2024-02-13 Thread via GitHub
sahnib opened a new pull request, #45092: URL: https://github.com/apache/spark/pull/45092 … ### What changes were proposed in this pull request? This change cleans up any dangling files tracked as being previously uploaded if they were cleaned up from the

Re: [PR] [SPARK-47035][SS][CONNECT] Protocol for Client-Side Listener [spark]

2024-02-13 Thread via GitHub
WweiL commented on PR #45091: URL: https://github.com/apache/spark/pull/45091#issuecomment-1942363588 @grundprinzip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] [SPARK-47035][SS][CONNECT] Protocol for Client-Side Listener [spark]

2024-02-13 Thread via GitHub
WweiL opened a new pull request, #45091: URL: https://github.com/apache/spark/pull/45091 ### What changes were proposed in this pull request? Currently, the StreamingQueryListener for Connect runs on the server side. From a customer point of view, the purpose of a

Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
dbatomic commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1488515575 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
dbatomic commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1488515363 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-13 Thread via GitHub
sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1488459138 ## sql/api/src/main/scala/org/apache/spark/sql/streaming/ValueState.scala: ## @@ -46,5 +46,5 @@ private[sql] trait ValueState[S] extends Serializable { def

Re: [PR] [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 [spark]

2024-02-13 Thread via GitHub
dongjoon-hyun commented on PR #45084: URL: https://github.com/apache/spark/pull/45084#issuecomment-1942113291 Since the RC1 vote fails, I backported this to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-46906][SS] Add a check for stateful operator change for streaming [spark]

2024-02-13 Thread via GitHub
jingz-db commented on code in PR #44927: URL: https://github.com/apache/spark/pull/44927#discussion_r1488336417 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -184,6 +185,41 @@ class IncrementalExecution( } } +

Re: [PR] [SPARK-47028][SQL][TESTS] Check `SparkUnsupportedOperationException` instead of `UnsupportedOperationException` [spark]

2024-02-13 Thread via GitHub
MaxGekk closed pull request #45082: [SPARK-47028][SQL][TESTS] Check `SparkUnsupportedOperationException` instead of `UnsupportedOperationException` URL: https://github.com/apache/spark/pull/45082 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-47028][SQL][TESTS] Check `SparkUnsupportedOperationException` instead of `UnsupportedOperationException` [spark]

2024-02-13 Thread via GitHub
MaxGekk commented on PR #45082: URL: https://github.com/apache/spark/pull/45082#issuecomment-1941167022 Merging to master. Thank you, @LuciferYang for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-13 Thread via GitHub
itholic commented on code in PR #44881: URL: https://github.com/apache/spark/pull/44881#discussion_r1487597019 ## python/pyspark/pandas/frame.py: ## @@ -10607,7 +10607,9 @@ def melt( name_like_string(name) if name is not None else "variable_{}".format(i)

Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
MaxGekk commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1487518940 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] [SPARK-46832][SQL] Introducing Collate and Collation expressions [spark]

2024-02-13 Thread via GitHub
MaxGekk commented on code in PR #45064: URL: https://github.com/apache/spark/pull/45064#discussion_r1487489877 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -1410,6 +1422,13 @@ public boolean equals(final Object other) { } } +

Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1487446994 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -67,6 +67,16 @@ trait ReadStateStore { def get(key: UnsafeRow,

Re: [PR] [SS][SPARK-46928] Add support for ListState in Arbitrary State API v2. [spark]

2024-02-13 Thread via GitHub
HeartSaVioR commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1487255434 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala: ## @@ -78,7 +78,7 @@ class StatePartitionReader(

Re: [PR] [SPARK-47028][SQL][TESTS] Check `SparkUnsupportedOperationException` instead of `UnsupportedOperationException` [spark]

2024-02-13 Thread via GitHub
MaxGekk commented on PR #45082: URL: https://github.com/apache/spark/pull/45082#issuecomment-1940685375 @panbingkun @srielau @LuciferYang @beliefer @cloud-fan Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please