Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-05 Thread via GitHub
wbo4958 commented on code in PR #45232: URL: https://github.com/apache/spark/pull/45232#discussion_r1513987011 ## python/pyspark/resource/profile.py: ## @@ -114,14 +122,26 @@ def id(self) -> int: int A unique id of this :class:`ResourceProfile`

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-05 Thread via GitHub
mridulm commented on code in PR #45240: URL: https://github.com/apache/spark/pull/45240#discussion_r1513970260 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -117,6 +117,14 @@ package object config { .bytesConf(ByteUnit.MiB)

[PR] [SPARK-47299][PYTHON][DOCS] Use the same `versions.json` in the dropdown of different versions of PySpark documents [spark]

2024-03-05 Thread via GitHub
panbingkun opened a new pull request, #45400: URL: https://github.com/apache/spark/pull/45400 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was

Re: [PR] [SPARK-47208][CORE] Allow overriding base overhead memory [spark]

2024-03-05 Thread via GitHub
mridulm commented on PR #45240: URL: https://github.com/apache/spark/pull/45240#issuecomment-1980260288 I would like to understand the usecase better here - It is still unclear to me what characteristics you are shooting for by this PR. Reduction in OOM is mentioned [as a

Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-05 Thread via GitHub
mihailom-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1513956640 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -764,6 +782,91 @@ abstract class TypeCoercionBase { } } +

Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-05 Thread via GitHub
mihailom-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1513955577 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -764,6 +782,91 @@ abstract class TypeCoercionBase { } } +

[PR] [WIP][SPARK-47298][BUILD] Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12` [spark]

2024-03-05 Thread via GitHub
panbingkun opened a new pull request, #45399: URL: https://github.com/apache/spark/pull/45399 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was

Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-05 Thread via GitHub
mihailom-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1513941920 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -958,14 +1062,16 @@ object TypeCoercion extends TypeCoercionBase {

Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-05 Thread via GitHub
panbingkun commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1513941395 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -156,15 +156,6 @@ class V2SessionCatalog(catalog:

Re: [PR] [SPARK-47293][CORE] Build batchSchema with sparkSchema instead of append one by one [spark]

2024-03-05 Thread via GitHub
yaooqinn commented on PR #45396: URL: https://github.com/apache/spark/pull/45396#issuecomment-1980207462 Thank you all. Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47293][CORE] Build batchSchema with sparkSchema instead of append one by one [spark]

2024-03-05 Thread via GitHub
yaooqinn closed pull request #45396: [SPARK-47293][CORE] Build batchSchema with sparkSchema instead of append one by one URL: https://github.com/apache/spark/pull/45396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47248][SQL][COLLATION] Extended string function support: contains [spark]

2024-03-05 Thread via GitHub
uros-db commented on code in PR #45382: URL: https://github.com/apache/spark/pull/45382#discussion_r1513909856 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -343,19 +346,33 @@ public boolean contains(final UTF8String substring) {

Re: [PR] [SPARK-46835][SQL][Collations] Join support for non-binary collations [spark]

2024-03-05 Thread via GitHub
cloud-fan closed pull request #45389: [SPARK-46835][SQL][Collations] Join support for non-binary collations URL: https://github.com/apache/spark/pull/45389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-46835][SQL][Collations] Join support for non-binary collations [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on PR #45389: URL: https://github.com/apache/spark/pull/45389#issuecomment-1980164987 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [DO-NOT-MERGE] Restructuring MasterSuite [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1980160010 https://github.com/HyukjinKwon/spark/actions/runs/8167770830 It should work now I believe but let me wait for the test result. -- This is an automated message from the Apache

Re: [PR] [SPARK-47294][SQL] OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) [spark]

2024-03-05 Thread via GitHub
AngersZh commented on PR #45398: URL: https://github.com/apache/spark/pull/45398#issuecomment-1980144655 > @AngersZh I guess you are changing a outdate codebase... This feature has been supported at #34542 (Spark 3.3) Yea...didn't see the change -- This is an automated

Re: [PR] [SPARK-47294][SQL] OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) [spark]

2024-03-05 Thread via GitHub
AngersZh closed pull request #45398: [SPARK-47294][SQL] OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) URL: https://github.com/apache/spark/pull/45398 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-05 Thread via GitHub
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1513862684 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala: ## @@ -148,6 +148,18 @@ abstract class QueryStageExec extends

Re: [PR] [SPARK-47146][CORE] Possible thread leak when doing sort merge join [spark]

2024-03-05 Thread via GitHub
pan3793 commented on code in PR #45327: URL: https://github.com/apache/spark/pull/45327#discussion_r1513846126 ## core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java: ## @@ -36,6 +38,7 @@ * of the file format). */ public final class

Re: [PR] [SPARK-47280][SQL] Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE [spark]

2024-03-05 Thread via GitHub
yaooqinn commented on PR #45384: URL: https://github.com/apache/spark/pull/45384#issuecomment-1980118542 Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-44259][CONNECT][TESTS] Make `connect-client-jvm` pass on Java 21 except `RemoteSparkSession`-based tests [spark]

2024-03-05 Thread via GitHub
dongjoon-hyun commented on PR #41805: URL: https://github.com/apache/spark/pull/41805#issuecomment-1980114512 Ya, @LuciferYang is right. To @Midhunpottammal , you need SPARK-43831 for Java 21 support. -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-47294][SQL] OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) [spark]

2024-03-05 Thread via GitHub
ulysses-you commented on PR #45398: URL: https://github.com/apache/spark/pull/45398#issuecomment-1980095300 @AngersZh I guess you are changing a outdate codebase... This feature has been supported at https://github.com/apache/spark/pull/34542 (Spark 3.3) -- This is an automated

Re: [PR] [SPARK-46992]Fix "Inconsistent results with 'sort', 'cache', and AQE." [spark]

2024-03-05 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1980094365 > I don't think it fixes the issue completely and there are some problems with the solution. I believe a proper solution is in the following comment: [#45181

Re: [PR] [Work in Progress] Experimenting to move TransportCipher to GCM based on Google Tink [spark]

2024-03-05 Thread via GitHub
sweisdb commented on PR #45394: URL: https://github.com/apache/spark/pull/45394#issuecomment-1980063354 @mridulm At its core, using AES-CTR mode without authentication is insecure because someone can change RPC contents by simply XORing the ciphertext. This can be demonstrated by modifying

Re: [PR] [SPARK-47294][SQL] OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) [spark]

2024-03-05 Thread via GitHub
AngersZh commented on PR #45398: URL: https://github.com/apache/spark/pull/45398#issuecomment-1980056417 ping @ulysses-you @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] [SPARK-47294][SQL] OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) [spark]

2024-03-05 Thread via GitHub
AngersZh opened a new pull request, #45398: URL: https://github.com/apache/spark/pull/45398 ### What changes were proposed in this pull request? Current OptimizeSkewInRebalanceRepartitions only support match case ShuffleQueryStageExec ``` plan transformUp { case

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
wForget commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513786488 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized plan

[PR] [WIP] Add ConvertCommandResultToLocalRelation rule [spark]

2024-03-05 Thread via GitHub
wForget opened a new pull request, #45397: URL: https://github.com/apache/spark/pull/45397 ### What changes were proposed in this pull request? Add ConvertCommandResultToLocalRelation rule. ### Why are the changes needed? ### Does this PR introduce

[PR] [SPARK-47293][CORE] Build batchSchema with sparkSchema instead of append one by one [spark]

2024-03-05 Thread via GitHub
zwangsheng opened a new pull request, #45396: URL: https://github.com/apache/spark/pull/45396 ### What changes were proposed in this pull request? Simplify the building process of `batchSchema` by passing `sparkSchema.fields` instead of adding fields one by one.

Re: [PR] [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics [spark]

2024-03-05 Thread via GitHub
zhuqi-lucas commented on code in PR #45314: URL: https://github.com/apache/spark/pull/45314#discussion_r1513776816 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionSize.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-05 Thread via GitHub
HyukjinKwon closed pull request #45395: [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF URL: https://github.com/apache/spark/pull/45395 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45395: URL: https://github.com/apache/spark/pull/45395#issuecomment-1980009033 Merged to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47280][SQL] Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE [spark]

2024-03-05 Thread via GitHub
yaooqinn closed pull request #45384: [SPARK-47280][SQL] Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE URL: https://github.com/apache/spark/pull/45384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47280][SQL] Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE [spark]

2024-03-05 Thread via GitHub
yaooqinn commented on PR #45384: URL: https://github.com/apache/spark/pull/45384#issuecomment-1980006732 Thanks for the review @cloud-fan. Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub
ueshin commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1513759920 ## python/pyspark/sql/profiler.py: ## @@ -224,6 +224,54 @@ def dump(id: int) -> None: for id in sorted(code_map.keys()): dump(id) +

Re: [PR] [SPARK-47247][SQL] Use smaller target size when coalescing partitions with exploding joins [spark]

2024-03-05 Thread via GitHub
cloud-fan closed pull request #45357: [SPARK-47247][SQL] Use smaller target size when coalescing partitions with exploding joins URL: https://github.com/apache/spark/pull/45357 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-40763][K8S] Should expose driver service name to config for user features [spark]

2024-03-05 Thread via GitHub
melin commented on PR #38202: URL: https://github.com/apache/spark/pull/38202#issuecomment-1979982318 cc @zwangsheng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-47247][SQL] Use smaller target size when coalescing partitions with exploding joins [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on PR #45357: URL: https://github.com/apache/spark/pull/45357#issuecomment-1979976679 thanks for the review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-46992]Fix "Inconsistent results with 'sort', 'cache', and AQE." [spark]

2024-03-05 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1979974166 > I don't think it fixes the issue completely and there are some problems with the solution. I believe a proper solution is in the following comment: [#45181

Re: [PR] [SPARK-47146][CORE][3.5] Possible thread leak when doing sort merge join [spark]

2024-03-05 Thread via GitHub
mridulm closed pull request #45390: [SPARK-47146][CORE][3.5] Possible thread leak when doing sort merge join URL: https://github.com/apache/spark/pull/45390 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47146][CORE][3.5] Possible thread leak when doing sort merge join [spark]

2024-03-05 Thread via GitHub
mridulm commented on PR #45390: URL: https://github.com/apache/spark/pull/45390#issuecomment-1979973804 Merged to branch-3.5 and branch-3.4 Thanks for fixing this @JacobZheng0927 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
wForget commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513732088 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized plan

Re: [PR] [SPARK-46992]Fix "Inconsistent results with 'sort', 'cache', and AQE." [spark]

2024-03-05 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1979968619 Maybe [this](https://github.com/apache/spark/pull/45181#issuecomment-1969241145) is the proper solution. But we need find all the children of logicalPlan if they're cached: ```scala

Re: [PR] [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition size [spark]

2024-03-05 Thread via GitHub
zhuqi-lucas commented on code in PR #45314: URL: https://github.com/apache/spark/pull/45314#discussion_r1513735868 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionSize.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition size [spark]

2024-03-05 Thread via GitHub
zhuqi-lucas commented on code in PR #45314: URL: https://github.com/apache/spark/pull/45314#discussion_r1513735868 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionSize.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
wForget commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513732088 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized plan

Re: [PR] [SPARK-47283][PYSPARK][DOCS] Remove Spark version drop down to the PySpark doc site [spark]

2024-03-05 Thread via GitHub
panbingkun closed pull request #45387: [SPARK-47283][PYSPARK][DOCS] Remove Spark version drop down to the PySpark doc site URL: https://github.com/apache/spark/pull/45387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-44259][CONNECT][TESTS] Make `connect-client-jvm` pass on Java 21 except `RemoteSparkSession`-based tests [spark]

2024-03-05 Thread via GitHub
LuciferYang commented on PR #41805: URL: https://github.com/apache/spark/pull/41805#issuecomment-1979957804 @Midhunpottammal Spark 3.5 has not announced support for Java 21, this feature is likely to be released in Spark 4.0 :) -- This is an automated message from the Apache Git

Re: [PR] [DO-NOT-MERGE] Restructuring MasterSuite [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1979952811 https://github.com/HyukjinKwon/spark/actions/runs/8165935950/job/22323872479 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47280][SQL] Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE [spark]

2024-03-05 Thread via GitHub
yaooqinn commented on PR #45384: URL: https://github.com/apache/spark/pull/45384#issuecomment-1979951059 cc @cloud-fan @dongjoon-hyun, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-46992]Fix "Inconsistent results with 'sort', 'cache', and AQE." [spark]

2024-03-05 Thread via GitHub
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1513722604 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,6 +3880,8 @@ class Dataset[T] private[sql]( */ def persist(newLevel: StorageLevel):

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
wForget commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513716970 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized plan

Re: [PR] [Work in Progress] Experimenting to move TransportCipher to GCM based on Google Tink [spark]

2024-03-05 Thread via GitHub
mridulm commented on PR #45394: URL: https://github.com/apache/spark/pull/45394#issuecomment-1979944365 It is not clear to me why we should be making this change, what the benefits are and what the current limitations are. Note that Spark 4.0 support TLS - so if this is still required in

Re: [PR] [SPARK-47285][SQL] AdaptiveSparkPlanExec should always use the context.session [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on PR #45388: URL: https://github.com/apache/spark/pull/45388#issuecomment-1979942297 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub
xinrong-meng commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1513714435 ## python/pyspark/sql/profiler.py: ## @@ -224,6 +224,54 @@ def dump(id: int) -> None: for id in sorted(code_map.keys()): dump(id)

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513713801 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized plan

Re: [PR] [SPARK-47285][SQL] AdaptiveSparkPlanExec should always use the context.session [spark]

2024-03-05 Thread via GitHub
yaooqinn commented on PR #45388: URL: https://github.com/apache/spark/pull/45388#issuecomment-1979937660 merged to master. thanks @ulysses-you @HyukjinKwon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47285][SQL] AdaptiveSparkPlanExec should always use the context.session [spark]

2024-03-05 Thread via GitHub
yaooqinn closed pull request #45388: [SPARK-47285][SQL] AdaptiveSparkPlanExec should always use the context.session URL: https://github.com/apache/spark/pull/45388 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-46992]Fix "Inconsistent results with 'sort', 'cache', and AQE." [spark]

2024-03-05 Thread via GitHub
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1513704138 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -193,10 +193,12 @@ private[sql] object Dataset { */ @Stable class Dataset[T] private[sql]( -

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
wForget commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513704269 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized plan

Re: [PR] [SPARK-46992]Fix "Inconsistent results with 'sort', 'cache', and AQE." [spark]

2024-03-05 Thread via GitHub
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1513702086 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -193,10 +193,12 @@ private[sql] object Dataset { */ @Stable class Dataset[T] private[sql]( -

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1513701849 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala: ## @@ -34,7 +34,7 @@ import

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1513701628 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala: ## @@ -34,7 +34,7 @@ import

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-05 Thread via GitHub
cloud-fan commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1513701474 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -328,6 +328,34 @@ abstract class Optimizer(catalogManager:

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1513693941 ## python/pyspark/sql/profiler.py: ## @@ -224,6 +224,54 @@ def dump(id: int) -> None: for id in sorted(code_map.keys()): dump(id) +

Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

2024-03-05 Thread via GitHub
rangadi commented on code in PR #44323: URL: https://github.com/apache/spark/pull/44323#discussion_r1513647366 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -198,11 +198,15 @@ object

Re: [PR] [WIP][BUILD] Upgrade RocksDB version to 8.11.3 [spark]

2024-03-05 Thread via GitHub
dongjoon-hyun commented on PR #45391: URL: https://github.com/apache/spark/pull/45391#issuecomment-1979890992 Thank you but we have the on-going work already. Let me close this, @neilramaswamy . - #45365 -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [WIP][BUILD] Upgrade RocksDB version to 8.11.3 [spark]

2024-03-05 Thread via GitHub
dongjoon-hyun closed pull request #45391: [WIP][BUILD] Upgrade RocksDB version to 8.11.3 URL: https://github.com/apache/spark/pull/45391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [WIP][BUILD] Upgrade RocksDB version to 8.11.3 [spark]

2024-03-05 Thread via GitHub
neilramaswamy commented on PR #45391: URL: https://github.com/apache/spark/pull/45391#issuecomment-1979889470 JDK 17 run: https://github.com/neilramaswamy/nr-spark/actions/runs/8164820755 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[PR] [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-05 Thread via GitHub
WweiL opened a new pull request, #45395: URL: https://github.com/apache/spark/pull/45395 ### What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/45380 to branch-3.5 The handy util function should not support streaming

[PR] [Work in Progress] Experimenting to move TransportCipher to GCM based on Google Tink [spark]

2024-03-05 Thread via GitHub
sweisdb opened a new pull request, #45394: URL: https://github.com/apache/spark/pull/45394 ### What changes were proposed in this pull request? The high level issue is that Apache Spark's RPC encryption is using unauthenticated CTR. We want to switch to GCM. The complication is

Re: [PR] [SPARK-45954][SQL] Remove redundant shuffles [spark]

2024-03-05 Thread via GitHub
github-actions[bot] closed pull request #43841: [SPARK-45954][SQL] Remove redundant shuffles URL: https://github.com/apache/spark/pull/43841 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-05 Thread via GitHub
WweiL commented on PR #45380: URL: https://github.com/apache/spark/pull/45380#issuecomment-1979850219 @HyukjinKwon Sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-44746][PYTHON] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-05 Thread via GitHub
HyukjinKwon closed pull request #45375: [SPARK-44746][PYTHON] Add more Python UDTF documentation for functions that accept input tables URL: https://github.com/apache/spark/pull/45375 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-44746][PYTHON] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45375: URL: https://github.com/apache/spark/pull/45375#issuecomment-1979844986 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513644737 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513644462 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -4483,6 +4478,17 @@ class Dataset[T] private[sql]( } } + /** Returns a optimized

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1513638446 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -655,7 +649,8 @@ class Dataset[T] private[sql]( * @group basic * @since 2.4.0 */ -

Re: [PR] [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45380: URL: https://github.com/apache/spark/pull/45380#issuecomment-1979831889 Merged to master. @WweiL would you mind opening a backporting PR to branch-3.5? -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-05 Thread via GitHub
HyukjinKwon closed pull request #45380: [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF URL: https://github.com/apache/spark/pull/45380 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [DO-NOT-MERGE] Restructuring MasterSuite [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1979825939 https://github.com/HyukjinKwon/spark/actions/runs/8164584039 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-47250][SS] Add additional validations and NERF changes for RocksDB state provider and use of column families [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45360: URL: https://github.com/apache/spark/pull/45360#discussion_r1513631712 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala: ## @@ -246,25 +246,35 @@ class RocksDB(

Re: [PR] [SPARK-47251][PYTHON][FOLLOWUP] Use __name__ instead of string representation [spark]

2024-03-05 Thread via GitHub
HyukjinKwon closed pull request #45393: [SPARK-47251][PYTHON][FOLLOWUP] Use __name__ instead of string representation URL: https://github.com/apache/spark/pull/45393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47251][PYTHON][FOLLOWUP] Use __name__ instead of string representation [spark]

2024-03-05 Thread via GitHub
HyukjinKwon commented on PR #45393: URL: https://github.com/apache/spark/pull/45393#issuecomment-1979820319 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
sahnib commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513626298 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -125,6 +125,12 @@ ], "sqlState" : "428FR" }, +

Re: [PR] [SPARK-47250][SS] Add additional validations and NERF changes for RocksDB state provider and use of column families [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45360: URL: https://github.com/apache/spark/pull/45360#discussion_r1513624528 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala: ## @@ -246,25 +246,35 @@ class RocksDB(

Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-05 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1513621273 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala: ## @@ -86,3 +88,53 @@ object StateTypesEncoder { new

Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1513618200 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala: ## @@ -0,0 +1,392 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47250][SS] Add additional validations and NERF changes for RocksDB state provider and use of column families [spark]

2024-03-05 Thread via GitHub
sahnib commented on code in PR #45360: URL: https://github.com/apache/spark/pull/45360#discussion_r1513594152 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala: ## @@ -246,25 +246,35 @@ class RocksDB(

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513595185 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateWatermarkSuite.scala: ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513594533 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateWatermarkSuite.scala: ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513593565 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateWatermarkSuite.scala: ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513592033 ## sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala: ## @@ -676,6 +678,43 @@ class KeyValueGroupedDataset[K, V] private[sql]( ) }

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513590003 ## sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala: ## @@ -676,6 +678,43 @@ class KeyValueGroupedDataset[K, V] private[sql]( ) }

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-05 Thread via GitHub
allisonwang-db commented on PR #45375: URL: https://github.com/apache/spark/pull/45375#issuecomment-1979764609 Looks good! Also cc @ueshin and @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513587938 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateWatermarkSuite.scala: ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513586387 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala: ## @@ -129,3 +129,37 @@ case class EventTimeWatermarkExec(

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513585434 ## sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala: ## @@ -676,6 +678,43 @@ class KeyValueGroupedDataset[K, V] private[sql]( ) }

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513584948 ## sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala: ## @@ -676,6 +678,43 @@ class KeyValueGroupedDataset[K, V] private[sql]( ) }

Re: [PR] [SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-05 Thread via GitHub
anishshri-db commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1513581167 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/EventTimeWatermark.scala: ## @@ -40,7 +41,8 @@ object EventTimeWatermark { case class

  1   2   3   >