Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1512285216 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryRowLevelOperationTableCatalog.scala: ## @@ -31,13 +31,23 @@ class

Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1512284803 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryPartitionTableCatalog.scala: ## @@ -31,12 +31,22 @@ class InMemoryPartitionTableCatalog

[PR] [SPARK-47248][SQL][COLLATION] Extended string function support: contains [spark]

2024-03-04 Thread via GitHub
uros-db opened a new pull request, #45382: URL: https://github.com/apache/spark/pull/45382 ### What changes were proposed in this pull request? Extend built-in string functions to support non-binary, non-lowercase collation for: contains. ### Why are the changes needed?

Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]

2024-03-04 Thread via GitHub
databius commented on PR #52: URL: https://github.com/apache/spark-docker/pull/52#issuecomment-1978112303 It would be great if we could support old versions instead of only spark 3.5+. I need an image that supports scala 2.13 and spark 3.4.2. Currently, I am building my own image

Re: [PR] [DO-NOT-MERGE] Restructuring MasterSuite [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1978108094 https://github.com/HyukjinKwon/spark/actions/runs/8152472664/job/22282001033 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47033][SQL] Fix EXECUTE IMMEDIATE USING does not recognize session variable names [spark]

2024-03-04 Thread via GitHub
andrej-db commented on code in PR #45293: URL: https://github.com/apache/spark/pull/45293#discussion_r1511301326 ## sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala: ## @@ -336,6 +336,19 @@ class QueryExecutionSuite extends SharedSparkSession {

Re: [PR] [SPARK-47176][SQL][FOLLOW-UP] resolveExpressions should have three versions which is the same as resolveOperators [spark]

2024-03-04 Thread via GitHub
cloud-fan closed pull request #45321: [SPARK-47176][SQL][FOLLOW-UP] resolveExpressions should have three versions which is the same as resolveOperators URL: https://github.com/apache/spark/pull/45321 -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-47176][SQL][FOLLOW-UP] resolveExpressions should have three versions which is the same as resolveOperators [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on PR #45321: URL: https://github.com/apache/spark/pull/45321#issuecomment-1978055064 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1512183617 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -328,6 +328,31 @@ abstract class Optimizer(catalogManager:

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1512181287 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -328,6 +328,31 @@ abstract class Optimizer(catalogManager:

Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

2024-03-04 Thread via GitHub
rangadi commented on code in PR #44323: URL: https://github.com/apache/spark/pull/44323#discussion_r1512166292 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -198,31 +198,50 @@ object

Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-04 Thread via GitHub
panbingkun commented on PR #45368: URL: https://github.com/apache/spark/pull/45368#issuecomment-1978003238 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [DO-NOT-MERGE] Test Hive pre-2.3.10 [spark]

2024-03-04 Thread via GitHub
LuciferYang commented on code in PR #45372: URL: https://github.com/apache/spark/pull/45372#discussion_r1512146209 ## pom.xml: ## @@ -199,14 +197,14 @@ 2.12.0 4.1.17 -14.0.1 +33.0.0-jre Review Comment: @pan3793 If we upgrade the version of Guava, and

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-04 Thread via GitHub
wForget commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1512148037 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -655,8 +655,17 @@ class Dataset[T] private[sql]( * @group basic * @since 2.4.0 */ -

Re: [PR] [SPARK-47146][CORE] Possible thread leak when doing sort merge join [spark]

2024-03-04 Thread via GitHub
mridulm commented on PR #45327: URL: https://github.com/apache/spark/pull/45327#issuecomment-1977986879 @JacobZheng0927, might be a good idea to backport this to 3.5 as well - will you be able to create a backport PR ? (I ran into some issue locally when trying to merge to branch-3.5 and

Re: [PR] [DO-NOT-MERGE] Test Hive pre-2.3.10 [spark]

2024-03-04 Thread via GitHub
LuciferYang commented on code in PR #45372: URL: https://github.com/apache/spark/pull/45372#discussion_r1512146209 ## pom.xml: ## @@ -199,14 +197,14 @@ 2.12.0 4.1.17 -14.0.1 +33.0.0-jre Review Comment: @pan3793 If we upgrade the version of Guava, and

Re: [PR] [SPARK-47146][CORE] Possible thread leak when doing sort merge join [spark]

2024-03-04 Thread via GitHub
mridulm commented on PR #45327: URL: https://github.com/apache/spark/pull/45327#issuecomment-1977985641 Merged to master. Thanks for fixing this @JacobZheng0927 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47146][CORE] Possible thread leak when doing sort merge join [spark]

2024-03-04 Thread via GitHub
mridulm closed pull request #45327: [SPARK-47146][CORE] Possible thread leak when doing sort merge join URL: https://github.com/apache/spark/pull/45327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [DO-NOT-MERGE] Test Hive pre-2.3.10 [spark]

2024-03-04 Thread via GitHub
LuciferYang commented on PR #45372: URL: https://github.com/apache/spark/pull/45372#issuecomment-1977981587 happy to see this hive upgrade, thanks to @pan3793 and @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1512136164 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -655,8 +655,17 @@ class Dataset[T] private[sql]( * @group basic * @since 2.4.0 */ -

Re: [PR] [DO-NOT-MERGE] Restructuring MasterSuite [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1977962956 https://github.com/HyukjinKwon/spark/actions/runs/8151190050 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on code in PR #45373: URL: https://github.com/apache/spark/pull/45373#discussion_r1512129374 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -655,8 +655,17 @@ class Dataset[T] private[sql]( * @group basic * @since 2.4.0 */

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-03-04 Thread via GitHub
TakawaAkirayo commented on code in PR #45367: URL: https://github.com/apache/spark/pull/45367#discussion_r1512107482 ## core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala: ## @@ -142,9 +142,11 @@ private class AsyncEventQueue(

Re: [PR] [SPARK-47218] [SQL] XML: Changed schemOfXml to fail on DROPMALFORMED mode [spark]

2024-03-04 Thread via GitHub
sandip-db commented on PR #45379: URL: https://github.com/apache/spark/pull/45379#issuecomment-1977944645 Why are the changes needed? DROPMALFORMED parse mode imply silently dropping the malformed record. But SchemaOfXml is expected to return a schema and may not have a valid schema

Re: [PR] [SPARK-47218] [SQL] XML: Changed schemOfXml to fail on DROPMALFORMED mode [spark]

2024-03-04 Thread via GitHub
sandip-db commented on PR #45379: URL: https://github.com/apache/spark/pull/45379#issuecomment-1977938882 nit in title: schemOfXml --> SchemaOfXml -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-03-04 Thread via GitHub
TakawaAkirayo commented on code in PR #45367: URL: https://github.com/apache/spark/pull/45367#discussion_r1512107482 ## core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala: ## @@ -142,9 +142,11 @@ private class AsyncEventQueue(

Re: [PR] [SPARK-47270][SQL] Dataset.isEmpty projects CommandResults locally [spark]

2024-03-04 Thread via GitHub
wForget commented on PR #45373: URL: https://github.com/apache/spark/pull/45373#issuecomment-1977928924 @peter-toth @HyukjinKwon @cloud-fan could you please take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-03-04 Thread via GitHub
beliefer commented on code in PR #45367: URL: https://github.com/apache/spark/pull/45367#discussion_r1512093558 ## core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala: ## @@ -142,9 +142,11 @@ private class AsyncEventQueue( eventCount.incrementAndGet()

Re: [PR] [SPARK-46989][SQL][CONNECT] Improve concurrency performance for SparkSession [spark]

2024-03-04 Thread via GitHub
beliefer commented on code in PR #45046: URL: https://github.com/apache/spark/pull/45046#discussion_r1482658250 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -854,7 +855,7 @@ object SparkSession extends Logging { // the

Re: [PR] [SPARK-47278][BUILD] Upgrade rocksdbjni to 8.11.3 [spark]

2024-03-04 Thread via GitHub
LuciferYang commented on PR #45365: URL: https://github.com/apache/spark/pull/45365#issuecomment-1977889225 Let's run another two or three rounds of tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

2024-03-04 Thread via GitHub
beliefer commented on code in PR #45374: URL: https://github.com/apache/spark/pull/45374#discussion_r1512072983 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -582,11 +582,7 @@ object SQLConf { val AUTO_BROADCASTJOIN_THRESHOLD =

Re: [PR] [SPARK-47176][SQL][FOLLOW-UP] resolveExpressions should have three versions which is the same as resolveOperators [spark]

2024-03-04 Thread via GitHub
amaliujia commented on code in PR #45321: URL: https://github.com/apache/spark/pull/45321#discussion_r1512059410 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper.scala: ## @@ -229,6 +229,14 @@ trait AnalysisHelper extends

Re: [PR] [DO-NOT-MERGE] Restructuring MasterSuite [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1977840217 https://github.com/HyukjinKwon/spark/actions/runs/8150215353/job/22276105311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47177][SQL][3.4] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on PR #45381: URL: https://github.com/apache/spark/pull/45381#issuecomment-1977838516 Thank you, @ulysses-you . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on PR #45282: URL: https://github.com/apache/spark/pull/45282#issuecomment-1977838135 Thank you! That's better and safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
ulysses-you commented on PR #45282: URL: https://github.com/apache/spark/pull/45282#issuecomment-1977837540 @dongjoon-hyun there are some conflicts, I created a new pr https://github.com/apache/spark/pull/45381 for branch-3.4 -- This is an automated message from the Apache Git Service.

[PR] [SPARK-47177][SQL][3.4] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
ulysses-you opened a new pull request, #45381: URL: https://github.com/apache/spark/pull/45381 This pr backport https://github.com/apache/spark/pull/45282 to branch-3.4 ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan

Re: [PR] [SPARK-47176][SQL][FOLLOW-UP] resolveExpressions should have three versions which is the same as resolveOperators [spark]

2024-03-04 Thread via GitHub
cloud-fan commented on code in PR #45321: URL: https://github.com/apache/spark/pull/45321#discussion_r1512050470 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper.scala: ## @@ -229,6 +229,14 @@ trait AnalysisHelper extends

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on PR #45282: URL: https://github.com/apache/spark/pull/45282#issuecomment-1977829088 BTW, #40812 landed at Apache Spark 3.4.1, doesn't it? If then, it seems that we need to backport this to branch-3.4, @ulysses-you . -- This is an automated message from the

Re: [PR] [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45380: URL: https://github.com/apache/spark/pull/45380#issuecomment-1977823298 cc @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-04 Thread via GitHub
WweiL commented on PR #45380: URL: https://github.com/apache/spark/pull/45380#issuecomment-1977822310 I'm having some local build issue, since this is a small change I want to defer the test to remote CI -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
ulysses-you closed pull request #45282: [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string URL: https://github.com/apache/spark/pull/45282 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[PR] [SPARK-47277] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

2024-03-04 Thread via GitHub
WweiL opened a new pull request, #45380: URL: https://github.com/apache/spark/pull/45380 ### What changes were proposed in this pull request? The handy util function should not support streaming dataframes, currently if you call it upon streaming queries, it throws a

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
ulysses-you commented on PR #45282: URL: https://github.com/apache/spark/pull/45282#issuecomment-1977821766 thanks for review, merging to master/branch-1.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
liuzqt commented on code in PR #45282: URL: https://github.com/apache/spark/pull/45282#discussion_r1512035940 ## sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryRelationSuite.scala: ## @@ -18,27 +18,42 @@ package org.apache.spark.sql.execution.columnar

Re: [PR] [SPARK-47247][SQL] Use smaller target size when coalescing partitions with exploding joins [spark]

2024-03-04 Thread via GitHub
yaooqinn commented on code in PR #45357: URL: https://github.com/apache/spark/pull/45357#discussion_r1512030230 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala: ## @@ -126,9 +126,12 @@ case class

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

2024-03-04 Thread via GitHub
nchammas commented on code in PR #45374: URL: https://github.com/apache/spark/pull/45374#discussion_r1512030136 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -582,11 +582,7 @@ object SQLConf { val AUTO_BROADCASTJOIN_THRESHOLD =

Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
anishshri-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1512013529 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -29,6 +29,14 @@ sealed trait RocksDBKeyStateEncoder {

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-04 Thread via GitHub
dtenedor commented on code in PR #45375: URL: https://github.com/apache/spark/pull/45375#discussion_r1511980443 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -63,6 +63,7 @@ To implement a Python UDTF, you first need to define a class implementing the me

Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
anishshri-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1512002599 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala: ## @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation

[PR] [SPARK-47218] [SQL] XML: Changed schemOfXml to fail on DROPMALFORMED mode [spark]

2024-03-04 Thread via GitHub
yhosny opened a new pull request, #45379: URL: https://github.com/apache/spark/pull/45379 ### What changes were proposed in this pull request? Changed schema_of_xml should fail with an error even on DROPMALFORMED mode to avoid creating schemas out of invalid XML.

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

2024-03-04 Thread via GitHub
itholic commented on PR #45377: URL: https://github.com/apache/spark/pull/45377#issuecomment-191911 On second thought, we need to keep the previous transformation stacktrace to provide more accurate context. Will push more commit to update it. -- This is an automated message

[PR] [WIP] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-04 Thread via GitHub
xinrong-meng opened a new pull request, #45378: URL: https://github.com/apache/spark/pull/45378 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was

Re: [PR] [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string [spark]

2024-03-04 Thread via GitHub
ulysses-you commented on code in PR #45282: URL: https://github.com/apache/spark/pull/45282#discussion_r1511998525 ## sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryRelationSuite.scala: ## @@ -18,27 +18,42 @@ package

Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on code in PR #45306: URL: https://github.com/apache/spark/pull/45306#discussion_r1511990741 ## python/pyspark/sql/worker/create_data_source.py: ## @@ -150,8 +150,8 @@ def main(infile: IO, outfile: IO) -> None: is_ddl_string = True

Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on code in PR #45306: URL: https://github.com/apache/spark/pull/45306#discussion_r1511990741 ## python/pyspark/sql/worker/create_data_source.py: ## @@ -150,8 +150,8 @@ def main(infile: IO, outfile: IO) -> None: is_ddl_string = True

Re: [PR] [SPARK-47252][DOCS] Clarify that pivot may trigger an eager computation [spark]

2024-03-04 Thread via GitHub
HyukjinKwon closed pull request #45363: [SPARK-47252][DOCS] Clarify that pivot may trigger an eager computation URL: https://github.com/apache/spark/pull/45363 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on code in PR #45374: URL: https://github.com/apache/spark/pull/45374#discussion_r1511987787 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -582,11 +582,7 @@ object SQLConf { val AUTO_BROADCASTJOIN_THRESHOLD =

Re: [PR] [SPARK-47252][DOCS] Clarify that pivot may trigger an eager computation [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45363: URL: https://github.com/apache/spark/pull/45363#issuecomment-1977745985 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

2024-03-04 Thread via GitHub
itholic commented on code in PR #45377: URL: https://github.com/apache/spark/pull/45377#discussion_r1511983720 ## python/pyspark/errors/utils.py: ## @@ -119,3 +127,73 @@ def get_message_template(self, error_class: str) -> str: message_template =

Re: [PR] [DO-NOT-MERGE] Avoid OOM in MasterSuite with Mac OS [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on code in PR #45366: URL: https://github.com/apache/spark/pull/45366#discussion_r1511982665 ## core/src/test/scala/org/apache/spark/deploy/master/WorkerSelectionSuite.scala: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

2024-03-04 Thread via GitHub
itholic commented on PR #45377: URL: https://github.com/apache/spark/pull/45377#issuecomment-1977737988 cc @HyukjinKwon FYI, I'm still working on Spark Connect support and unit tests but the basic structure is ready for review. FYI, also cc @MaxGekk as you made a similar contribution

Re: [PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

2024-03-04 Thread via GitHub
itholic commented on PR #45377: URL: https://github.com/apache/spark/pull/45377#issuecomment-1977735804 I'm still working on Spark Connect support and unit tests, but the basic structure is ready for review. -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [DO-NOT-MERGE] Avoid OOM in MasterSuite with Mac OS [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1977735310 test: https://github.com/HyukjinKwon/spark/actions/runs/8149143761/job/22273296949 -- This is an automated message from the Apache Git Service. To respond to the message, please

[PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

2024-03-04 Thread via GitHub
itholic opened a new pull request, #45377: URL: https://github.com/apache/spark/pull/45377 ### What changes were proposed in this pull request? This PR introduces an enhancement to the error messages generated by PySpark's DataFrame API, adding detailed context about the location

Re: [PR] [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo [spark]

2024-03-04 Thread via GitHub
arzavj commented on PR #45301: URL: https://github.com/apache/spark/pull/45301#issuecomment-1977729895 @HyukjinKwon do you know when I can expect 3.5.2 to be released to be able to take advantage of this bug fix? -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]

2024-03-04 Thread via GitHub
HyukjinKwon commented on PR #45306: URL: https://github.com/apache/spark/pull/45306#issuecomment-1977726281 I think the actions should be enabled at https://github.com/sunan135/spark/settings/actions by `Allow all actions and reusable workflows` -- This is an automated message from the

Re: [PR] [SPARK-45954][SQL] Remove redundant shuffles [spark]

2024-03-04 Thread via GitHub
github-actions[bot] commented on PR #43841: URL: https://github.com/apache/spark/pull/43841#issuecomment-1977715036 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[PR] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-03-04 Thread via GitHub
sahnib opened a new pull request, #45376: URL: https://github.com/apache/spark/pull/45376 ### What changes were proposed in this pull request? This PR adds support to define event time column in the output dataset of `TransformWithState` operator. The new event time column

Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]

2024-03-04 Thread via GitHub
helenweng-stripe commented on PR #33934: URL: https://github.com/apache/spark/pull/33934#issuecomment-1977631285 Wonder if we can reconsider merging this PR in? We've had to make a similar patch internally to support PySpark users. -- This is an automated message from the Apache Git

Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]

2024-03-04 Thread via GitHub
chaoqin-li1123 commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1511914150 ## python/pyspark/sql/datasource.py: ## @@ -298,6 +320,104 @@ def read(self, partition: InputPartition) -> Iterator[Union[Tuple, Row]]: ... +class

Re: [PR] [WIP] Test rocksdbjni 8.11.3 [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on PR #45365: URL: https://github.com/apache/spark/pull/45365#issuecomment-1977616307 Thank you, @LuciferYang . Is it ready? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

2024-03-04 Thread via GitHub
ueshin commented on code in PR #45269: URL: https://github.com/apache/spark/pull/45269#discussion_r1511898364 ## python/docs/source/reference/pyspark.sql/spark_session.rst: ## @@ -49,6 +49,7 @@ See also :class:`SparkSession`. SparkSession.createDataFrame

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-04 Thread via GitHub
allisonwang-db commented on code in PR #45375: URL: https://github.com/apache/spark/pull/45375#discussion_r1511851865 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -63,6 +63,7 @@ To implement a Python UDTF, you first need to define a class implementing the me

Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

2024-03-04 Thread via GitHub
xinrong-meng commented on code in PR #45269: URL: https://github.com/apache/spark/pull/45269#discussion_r1511840961 ## python/docs/source/reference/pyspark.sql/spark_session.rst: ## @@ -49,6 +49,7 @@ See also :class:`SparkSession`. SparkSession.createDataFrame

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
agubichev commented on PR #45125: URL: https://github.com/apache/spark/pull/45125#issuecomment-1977518691 > Thanks for the fix, looks good overall. > > Let's add a gating flag for this change just in case of any issues. added a flag -- This is an automated message from the

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
agubichev commented on PR #45125: URL: https://github.com/apache/spark/pull/45125#issuecomment-1977518871 @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
agubichev commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1511831412 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -328,6 +328,30 @@ abstract class Optimizer(catalogManager:

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
agubichev commented on PR #45125: URL: https://github.com/apache/spark/pull/45125#issuecomment-1977503018 > What about if there's another node above the aggregate in the subquery, such as a filter after the aggregate (having clause)? added a test, but any non-trivial node about the

Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-04 Thread via GitHub
agubichev commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1511830190 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -328,6 +328,30 @@ abstract class Optimizer(catalogManager:

Re: [PR] [SPARK-42627][SPARK-26494][SQL] Support Oracle TIMESTAMP WITH LOCAL TIME ZONE [spark]

2024-03-04 Thread via GitHub
steveloughran commented on PR #45337: URL: https://github.com/apache/spark/pull/45337#issuecomment-1977449543 @dongjoon-hyun I'm just thinking of all the timestamps in ORC and parquet and when they are local vs UTC... -- This is an automated message from the Apache Git Service. To

Re: [PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-04 Thread via GitHub
dtenedor commented on PR #45375: URL: https://github.com/apache/spark/pull/45375#issuecomment-1977448390 cc @allisonwang-db @ueshin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-03-04 Thread via GitHub
y-wei commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1511783477 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask =

[PR] [SPARK-44746][Python] Add more Python UDTF documentation for functions that accept input tables [spark]

2024-03-04 Thread via GitHub
dtenedor opened a new pull request, #45375: URL: https://github.com/apache/spark/pull/45375 ### What changes were proposed in this pull request? This PR adds more Python UDTF documentation for functions that accept input tables. ### Why are the changes needed? This

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-03-04 Thread via GitHub
sadikovi commented on PR #45266: URL: https://github.com/apache/spark/pull/45266#issuecomment-1977437455 cc @y-wei to address the remaining comments and retrigger the tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on PR #45266: URL: https://github.com/apache/spark/pull/45266#issuecomment-1977373125 How about the AS-IS status, @mridulm ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

2024-03-04 Thread via GitHub
neilramaswamy commented on code in PR #44323: URL: https://github.com/apache/spark/pull/44323#discussion_r1511719668 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -198,31 +198,52 @@ object

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

2024-03-04 Thread via GitHub
nchammas commented on code in PR #45374: URL: https://github.com/apache/spark/pull/45374#discussion_r1511715423 ## docs/sql-performance-tuning.md: ## @@ -157,6 +157,18 @@ SELECT /*+ REBALANCE(3, c) */ * FROM t; For more details please refer to the documentation of

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1511697862 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -29,6 +29,14 @@ sealed trait RocksDBKeyStateEncoder {

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on PR #45341: URL: https://github.com/apache/spark/pull/45341#issuecomment-1977336528 Thanks Eric for reviews on my old PR. I've resolved them and incorporated in this one already. -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-47242][BUILD] Bump ap-loader 3.0(v8) to support for async-profiler 3.0 [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun commented on PR #45351: URL: https://github.com/apache/spark/pull/45351#issuecomment-1977336300 I added you to the Apache Spark contributor group, @SteNicholas , and assigned SPARK-47242 to you. Welcome to the Apache Spark community! -- This is an automated message from

Re: [PR] [SPARK-47242][BUILD] Bump ap-loader 3.0(v8) to support for async-profiler 3.0 [spark]

2024-03-04 Thread via GitHub
dongjoon-hyun closed pull request #45351: [SPARK-47242][BUILD] Bump ap-loader 3.0(v8) to support for async-profiler 3.0 URL: https://github.com/apache/spark/pull/45351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

2024-03-04 Thread via GitHub
xinrong-meng commented on code in PR #45269: URL: https://github.com/apache/spark/pull/45269#discussion_r1511707998 ## python/docs/source/development/debugging.rst: ## @@ -341,7 +372,12 @@ Python/Pandas UDF ~ To use this on Python/Pandas UDFs, PySpark

Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

2024-03-04 Thread via GitHub
xinrong-meng commented on code in PR #45269: URL: https://github.com/apache/spark/pull/45269#discussion_r1511705327 ## python/docs/source/reference/pyspark.sql/spark_session.rst: ## @@ -49,6 +49,7 @@ See also :class:`SparkSession`. SparkSession.createDataFrame

Re: [PR] [SPARK-47242][BUILD] Bump ap-loader 3.0(v8) to support for async-profiler 3.0 [spark]

2024-03-04 Thread via GitHub
parthchandra commented on PR #45351: URL: https://github.com/apache/spark/pull/45351#issuecomment-1977330468 > @parthchandra, thank you try it out. Have you tried anything wrong? I was able to try it out locally (non production) and the jfr files written were fine. I didn't see much

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1511697862 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -29,6 +29,14 @@ sealed trait RocksDBKeyStateEncoder {

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1511697862 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -29,6 +29,14 @@ sealed trait RocksDBKeyStateEncoder {

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1511693823 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -60,13 +60,25 @@ trait ReadStateStore { /** Version of the data

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1511697862 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -29,6 +29,14 @@ sealed trait RocksDBKeyStateEncoder {

Re: [PR] [SS] Add MapState implementation for State API v2. [spark]

2024-03-04 Thread via GitHub
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1511697862 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -29,6 +29,14 @@ sealed trait RocksDBKeyStateEncoder {

  1   2   >