date:20240428

Re: [PR] [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun closed pull request #46274: [MINOR][DOCS] Remove space in the 
middle of configuration name in Arrow-optimized Python UDF page
URL: https://github.com/apache/spark/pull/46274


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46274:
URL: https://github.com/apache/spark/pull/46274#issuecomment-2081920450

   Merged to master~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun closed pull request #46271: [SPARK-48036][DOCS] Update 
`sql-ref-ansi-compliance.md` and `sql-ref-identifier.md`
URL: https://github.com/apache/spark/pull/46271


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46271:
URL: https://github.com/apache/spark/pull/46271#issuecomment-2081891190

   I attached the screenshots too.
   Merged to master~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page [spark]

2024-04-28 Thread via GitHub



HyukjinKwon opened a new pull request, #46274:
URL: https://github.com/apache/spark/pull/46274

   ### What changes were proposed in this pull request?
   
   This PR removes a space in the middle of configuration name in 
Arrow-optimized Python UDF page.
   
   ![Screenshot 2024-04-29 at 1 53 42 
PM](https://github.com/apache/spark/assets/6477701/46b7c448-fb30-4838-a5ba-c8f1c23398fd)
   
   
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#arrow-python-udfs
   
   ### Why are the changes needed?
   
   So users can copy and paste the configuration names properly.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes it fixes the doc.
   
   ### How was this patch tested?
   
   Manually built the docs, and checked.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46271:
URL: https://github.com/apache/spark/pull/46271#issuecomment-2081887536

   Thank you for helping me revise this doc, @yaooqinn !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on code in PR #46271:
URL: https://github.com/apache/spark/pull/46271#discussion_r1582534717


##
docs/sql-ref-ansi-compliance.md:
##
@@ -67,10 +67,8 @@ The following subsections present behaviour changes in 
arithmetic operations, ty
 
 ### Arithmetic Operations
 
-In Spark SQL, arithmetic operations performed on numeric types (with the 
exception of decimal) are not checked for overflows by default.
-This means that in case an operation causes overflows, the result is the same 
with the corresponding operation in a Java/Scala program (e.g., if the sum of 2 
integers is higher than the maximum value representable, the result is a 
negative number).
-On the other hand, Spark SQL returns null for decimal overflows.
-When `spark.sql.ansi.enabled` is set to `true` and an overflow occurs in 
numeric and interval arithmetic operations, it throws an arithmetic exception 
at runtime.
+In Spark SQL, by default, Spark throws an artithmetic exception at runtime for 
non-decimal numeric overflows and returns null for decimal type overflows.

Review Comment:
   I agree that it looks much clear and has the missed `interval` part. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun closed pull request #46258: [SPARK-48029][INFRA] Update the 
packages name removed in building the spark docker image
URL: https://github.com/apache/spark/pull/46258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46258:
URL: https://github.com/apache/spark/pull/46258#issuecomment-2081880744

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-04-28 Thread via GitHub



cxzl25 opened a new pull request, #46273:
URL: https://github.com/apache/spark/pull/46273

   ### What changes were proposed in this pull request?
   This PR aims to fix SortShuffleWriter lacks shuffle write related metrics 
resulting in potentially inaccurate data.
   
   ### Why are the changes needed?
   When the shuffle writer is SortShuffleWriter, it does not use 
SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain 
runtime statistics and the rowCount obtained is 0.
   
   Some optimization rules rely on rowCount statistics, such as 
`EliminateLimits`. Because rowCount is 0, it removes the limit operator. At 
this time, we get data results without limit.
   
   
https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172
   
   
https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   
   ### How was this patch tested?
   Production environment verification.
   
   **master metrics**
   https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;>
   
   **PR metrics**
   
   https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;>
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46270:
URL: https://github.com/apache/spark/pull/46270#issuecomment-2081840828

   
   
   Merged to master(4.0.0), 3.5.2 and 3.4.4. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn closed pull request #46270: [SPARK-48034][TESTS] NullPointerException 
in MapStatusesSerDeserBenchmark
URL: https://github.com/apache/spark/pull/46270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46271:
URL: https://github.com/apache/spark/pull/46271#issuecomment-2081838017

   Thank you so much, @yaooqinn ! 
   I updated the PR according to your comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46270:
URL: https://github.com/apache/spark/pull/46270#issuecomment-2081825069

   Thank you very much @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46271:
URL: https://github.com/apache/spark/pull/46271#issuecomment-2081821772

   
   LGTM, only [content 
here](https://github.com/apache/spark/pull/46271/files#diff-54eee79bd27cf5ca1288b078a7b0b1b5ae8ae9d8b4ee7fb75f0b0c7cdaef0da8L70-L73)
 might need further revison
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on PR #46264:
URL: https://github.com/apache/spark/pull/46264#issuecomment-2081820429

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582501170


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -124,8 +130,8 @@ object LogKeys {
   case object DIFF_DELTA extends LogKey
   case object DIVISIBLE_CLUSTER_INDICES_SIZE extends LogKey
   case object DRIVER_ID extends LogKey
-  case object DROPPED_PARTITIONS extends LogKey

Review Comment:
   Rename `DROPPED_PARTITIONS` to `NUM_DROPPED_PARTITIONS`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46270:
URL: https://github.com/apache/spark/pull/46270#issuecomment-2081817478

   Feel free to merge and backport wherever you need this, @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46271:
URL: https://github.com/apache/spark/pull/46271#issuecomment-2081817738

   Could you review this documentation PR, @yaooqinn ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582501464


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -191,20 +203,23 @@ object LogKeys {
   case object HIVE_OPERATION_TYPE extends LogKey
   case object HOST extends LogKey
   case object HOST_PORT extends LogKey
-  case object IDENTIFIER extends LogKey

Review Comment:
   Currently, only `two places` use this `key`, and they all mean `the name of 
the table`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582501170


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -124,8 +130,8 @@ object LogKeys {
   case object DIFF_DELTA extends LogKey
   case object DIVISIBLE_CLUSTER_INDICES_SIZE extends LogKey
   case object DRIVER_ID extends LogKey
-  case object DROPPED_PARTITIONS extends LogKey

Review Comment:
   Rename `DROPPED_PARTITIONS` to `NUM_DROPPED_PARTITIONS`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582501097


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -111,7 +117,7 @@ object LogKeys {
   case object DATA_FILE_NUM extends LogKey
   case object DATA_SOURCE extends LogKey
   case object DATA_SOURCES extends LogKey
-  case object DATA_SOURCE_PROVIDER extends LogKey

Review Comment:
   Unified as` LogKeys.DATA_SOURCE`, remove `LogKeys.DATA_SOURCE_PROVIDER`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582500958


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -191,20 +203,23 @@ object LogKeys {
   case object HIVE_OPERATION_TYPE extends LogKey
   case object HOST extends LogKey
   case object HOST_PORT extends LogKey
-  case object IDENTIFIER extends LogKey

Review Comment:
   Unified as` LogKeys.TABLE_NAME`, remove `LogKeys.IDENTIFIER`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582500836


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -276,31 +299,40 @@ object LogKeys {
   case object NUM_FILES_REUSED extends LogKey
   case object NUM_FREQUENT_ITEMS extends LogKey
   case object NUM_ITERATIONS extends LogKey
+  case object NUM_LEFT_PARTITION_VALUES extends LogKey
+  case object NUM_LOADED_ENTRIES extends LogKey
   case object NUM_LOCAL_FREQUENT_PATTERN extends LogKey
   case object NUM_PARTITION extends LogKey
+  case object NUM_PARTITION_VALUES extends LogKey
   case object NUM_POINT extends LogKey
   case object NUM_PREFIXES extends LogKey
+  case object NUM_RIGHT_PARTITION_VALUES extends LogKey
   case object NUM_SEQUENCES extends LogKey
+  case object NUM_VERSIONS_RETAIN extends LogKey
+  case object OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD extends LogKey
   case object OBJECT_ID extends LogKey
   case object OFFSET extends LogKey
   case object OFFSETS extends LogKey
   case object OFFSET_SEQUENCE_METADATA extends LogKey
   case object OLD_BLOCK_MANAGER_ID extends LogKey
   case object OLD_VALUE extends LogKey
+  case object OPEN_COST_IN_BYTES extends LogKey
   case object OPTIMIZED_PLAN_COLUMNS extends LogKey
   case object OPTIMIZER_CLASS_NAME extends LogKey
   case object OPTIONS extends LogKey
   case object OP_ID extends LogKey
   case object OP_TYPE extends LogKey
+  case object OUTPUT extends LogKey
   case object OVERHEAD_MEMORY_SIZE extends LogKey
   case object PARSE_MODE extends LogKey
+  case object PARTITIONED_FILE extends LogKey
   case object PARTITIONED_FILE_READER extends LogKey
-  case object PARTITIONS_SIZE extends LogKey

Review Comment:
   Unified as` LogKeys.COUNT`, remove `LogKeys.PARTITIONS_SIZE`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582500556


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -349,11 +387,11 @@ object LogKeys {
   case object RETRY_COUNT extends LogKey
   case object RETRY_INTERVAL extends LogKey
   case object RIGHT_EXPR extends LogKey
+  case object RIGHT_LOGICAL_PLAN_STATS_SIZE_IN_BYTES extends LogKey
   case object RMSE extends LogKey
   case object ROCKS_DB_LOG_LEVEL extends LogKey
   case object ROCKS_DB_LOG_MESSAGE extends LogKey
   case object RPC_ENDPOINT_REF extends LogKey
-  case object RULE_BATCH_NAME extends LogKey

Review Comment:
   Unified as `LogKeys.BATCH_NAME`, remove `LogKeys.RULE_BATCH_NAME`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46266:
URL: https://github.com/apache/spark/pull/46266#issuecomment-2081812844

   Pending CI results
   
   
   https://github.com/yaooqinn/spark/actions/runs/8872679083
   https://github.com/yaooqinn/spark/actions/runs/8872179637
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun opened a new pull request, #46271:
URL: https://github.com/apache/spark/pull/46271

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on code in PR #46270:
URL: https://github.com/apache/spark/pull/46270#discussion_r1582497889


##
core/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala:
##
@@ -123,7 +123,6 @@ object MapStatusesSerDeserBenchmark extends BenchmarkBase {
   }
 
   override def afterAll(): Unit = {
-tracker.stop()

Review Comment:
   yes, it was suppressed by tryLogNonFatalError



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on code in PR #46270:
URL: https://github.com/apache/spark/pull/46270#discussion_r1582497889


##
core/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala:
##
@@ -123,7 +123,6 @@ object MapStatusesSerDeserBenchmark extends BenchmarkBase {
   }
 
   override def afterAll(): Unit = {
-tracker.stop()

Review Comment:
   yes, the error was suppressed by tryLogNonFatalError



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582496007


##
sql/core/src/main/scala/org/apache/spark/sql/execution/r/ArrowRRunner.scala:
##
@@ -161,17 +166,14 @@ class ArrowRRunner(
   val input = dataStream.readDouble
   val compute = dataStream.readDouble
   val output = dataStream.readDouble
-  logInfo(
-("Times: boot = %.3f s, init = %.3f s, broadcast = %.3f s, " +
-  "read-input = %.3f s, compute = %.3f s, write-output = %.3f 
s, " +
-  "total = %.3f s").format(
-  boot,
-  init,
-  broadcast,
-  input,
-  compute,
-  output,
-  boot + init + broadcast + input + compute + output))
+  logInfo(log"Times: boot = ${MDC(LogKeys.BOOT, format(boot))} s, 
" +

Review Comment:
   I'm not sure whether to prefix the following keys as `R_`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582495818


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
-  extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
-currentPlan = left, targetKey = targetKey).orElse {
-// We can also extract from the right side if the join keys are 
transitive.

Review Comment:
   let's enrich this comment following 
https://github.com/apache/spark/pull/46263/files#r1582495367



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582492374


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.

Review Comment:
   I'm glad we described the idea in the comments. It's clear that for certain 
join types, the join child output is not a superset of the join output for 
transitive join keys.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582495367


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
-  extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,

Review Comment:
   Let's clarify the expected strategy a bit more: For the exact join key 
match, like the left table here, it's always OK to generate the runtime filter 
using this left table, no matter what the join type is. This is because left 
table always produce a superset of output of the join output regarding the left 
keys.
   
   For transitive join key match, it's different. The right table here does not 
always generate a superset output regarding left keys. Let's look at an example
   ```
   left table: 1, 2, 3
   right table, 3, 4
   left outer join output: (1, null), (2, null), (3, 3)
   left keys: 1, 2, 3
   ```
   So we can't use right table to generate runtime filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582495367


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
-  extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,

Review Comment:
   Let's clarify the expected strategy a bit more: For the exact join key 
match, like the left table here, it's always OK to generate the runtime filter 
using this left table, no matter what the join type is. This is because left 
table always produce a superset of output of the join output regarding the left 
keys.
   
   For transitive join key match, it's different. The right table here does not 
always generate a superset output regarding left keys. Let's look at an example
   ```
   left table: 1, 2, 3
   right table, 3, 4
   left outer join output: 1, 2, 3
   ```
   So we can't use right table to generate runtime filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582492374


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.

Review Comment:
   I'm glad we described the idea in the comments. It's clear that for certain 
join types, the join child output is not a superset of the join output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



beliefer commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582090760


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -86,6 +87,17 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   private def extractSelectiveFilterOverScan(
   plan: LogicalPlan,
   filterCreationSideKey: Expression): Option[(Expression, LogicalPlan)] = {
+
+def canExtractRight(joinType: JoinType): Boolean = joinType match {
+  case Inner | LeftSemi | RightOuter => true
+  case _ => false
+}
+
+def canExtractLeft(joinType: JoinType): Boolean = joinType match {
+  case Inner | LeftSemi | LeftOuter | LeftAnti => true
+  case _ => false
+}

Review Comment:
   Shall we add the two methods into `joins.scala`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on code in PR #46270:
URL: https://github.com/apache/spark/pull/46270#discussion_r1582489367


##
core/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala:
##
@@ -123,7 +123,6 @@ object MapStatusesSerDeserBenchmark extends BenchmarkBase {
   }
 
   override def afterAll(): Unit = {
-tracker.stop()

Review Comment:
   Oh, this seems to exist since Apache Spark 3.0.0, doesn't it?
   - https://github.com/apache/spark/pull/26169



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn opened a new pull request, #46270:
URL: https://github.com/apache/spark/pull/46270

   
   
   
   ### What changes were proposed in this pull request?
   
   
   This PR fixes an NPE in MapStatusesSerDeserBenchmark. The cause is that we 
try to stop the tracker twice.
   
   
   ```
   3197java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.rpc.RpcEndpointRef.askSync(Object, scala.reflect.ClassTag)" 
because the return value of 
"org.apache.spark.MapOutputTracker.trackerEndpoint()" is null
   3198 at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:541)
   3199 at 
org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:551)
   3200 at 
org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:1242)
   3201 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:112)
   3202 at 
org.apache.spark.SparkContext.$anonfun$stop$25(SparkContext.scala:2354)
   3203 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1294)
   3204 at org.apache.spark.SparkContext.stop(SparkContext.scala:2354)
   3205 at org.apache.spark.SparkContext.stop(SparkContext.scala:2259)
   3206 at 
org.apache.spark.MapStatusesSerDeserBenchmark$.afterAll(MapStatusesSerDeserBenchmark.scala:128)
   3207 at org.apache.spark.benchmark.BenchmarkBase.main(BenchmarkBase.scala:80)
   3208 at 
org.apache.spark.MapStatusesSerDeserBenchmark.main(MapStatusesSerDeserBenchmark.scala)
   3209 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   3210 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   3211 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   3212 at java.base/java.lang.reflect.Method.invoke(Method.java:568)
   3213 at 
org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128)
   3214 at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1323)
   3215 at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91)
   3216 at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala)
   ```
   ### Why are the changes needed?
   
   test bugfix
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   
   manually 
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47954][K8S] Support creating ingress entry for external UI access [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46184:
URL: https://github.com/apache/spark/pull/46184#issuecomment-2081769824

   Gentle ping, @pan3793 . It would be great if we can land this before Apache 
Spark 4.0.0-preview.
   
   To @cloud-fan , when do you think you are going to cut `4.0.0-preview` tag?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582479845


##
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala:
##
@@ -340,11 +342,11 @@ trait AggregateCodegenSupport
 }
 Some(splitCodes)
   } else {
-val errMsg = "Failed to split aggregate code into small functions 
because the parameter " +
-  "length of at least one split function went over the JVM limit: " +
-  CodeGenerator.MAX_JVM_METHOD_PARAMS_LENGTH
+val errMsg = log"Failed to split aggregate code into small functions 
because the " +
+  log"parameter length of at least one split function went over the 
JVM limit: " +
+  log"${MDC(MAX_JVM_METHOD_PARAMS_LENGTH, 
CodeGenerator.MAX_JVM_METHOD_PARAMS_LENGTH)}"
 if (Utils.isTesting) {
-  throw SparkException.internalError(errMsg)
+  throw SparkException.internalError(errMsg.message)
 } else {
   logInfo(errMsg)

Review Comment:
   For it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46266:
URL: https://github.com/apache/spark/pull/46266#issuecomment-2081767859

   Thank you @dongjoon-hyun.
   
   Unfortunately, they timed out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46268:
URL: https://github.com/apache/spark/pull/46268#issuecomment-2081766387

   Thank you, @panbingkun .
   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun closed pull request #46268: [SPARK-48032][BUILD] Upgrade 
`commons-codec` to 1.17.0
URL: https://github.com/apache/spark/pull/46268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582478002


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -468,7 +467,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
 } else if (filteredRules.nonEmpty) {
   Some(Batch(batch.name, batch.strategy, filteredRules: _*))
 } else {
-  logInfo(log"Optimization batch '${MDC(RULE_BATCH_NAME, batch.name)}' 
" +
+  logInfo(log"Optimization batch '${MDC(LogKeys.RULE_BATCH_NAME, 
batch.name)}' " +

Review Comment:
   Unified as `LogKeys.BATCH_NAME`, remove `LogKeys.RULE_BATCH_NAME`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582474060


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala:
##
@@ -494,7 +495,7 @@ object DataSourceStrategy
   val partitionSet = AttributeSet(partitionColumns)
   val predicates = ExpressionSet(normalizedFilters
 .flatMap(extractPredicatesWithinOutputSet(_, partitionSet)))
-  logInfo(s"Pruning directories with: ${predicates.mkString(",")}")
+  logInfo(log"Pruning directories with: ${MDC(PREDICATES, 
predicates.mkString(","))}")

Review Comment:
   I'm not sure if `EXPRS` is more general.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46149:
URL: https://github.com/apache/spark/pull/46149#issuecomment-2081755480

   I added you (Xi Chen) to Apache Spark contributor group and assigned 
SPARK-47730 to you.
   Welcome to the Apache Spark community, @jshmchenxi !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582473304


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##
@@ -233,7 +233,7 @@ class FileScanRDD(
 if (files.hasNext) {
   currentFile = files.next()
   updateMetadataRow()
-  logInfo(s"Reading File $currentFile")
+  logInfo(log"Reading File ${MDC(PARTITIONED_FILE, currentFile)}")

Review Comment:
   The class of `currentFile` is `PartitionedFile`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun closed pull request #46149: [SPARK-47730][K8S] Support `APP_ID` 
and `EXECUTOR_ID` placeholders in labels
URL: https://github.com/apache/spark/pull/46149


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun commented on code in PR #46264:
URL: https://github.com/apache/spark/pull/46264#discussion_r1582471773


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala:
##
@@ -39,7 +39,7 @@ class FilePartitionReader[T](
 if (currentReader == null) {
   if (files.hasNext) {
 val file = files.next()
-logInfo(s"Reading file $file")
+logInfo(log"Reading file ${MDC(PARTITIONED_FILE, file)}")

Review Comment:
   The class of `file` is `PartitionedFile`, So `LogKeys.PATH` is not used
   
https://github.com/apache/spark/blob/8c446f35dc038e92ddc36f375c304a8879ac6f5f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L70-L72



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on code in PR #45977:
URL: https://github.com/apache/spark/pull/45977#discussion_r1582469363


##
python/pyspark/sql/datasource_internal.py:
##
@@ -0,0 +1,146 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import json
+import copy
+from itertools import chain
+from typing import Iterator, List, Optional, Sequence, Tuple
+
+from pyspark.sql.datasource import (
+DataSource,
+DataSourceStreamReader,
+InputPartition,
+SimpleDataSourceStreamReader,
+)
+from pyspark.sql.types import StructType
+from pyspark.errors import PySparkNotImplementedError
+
+
+def _streamReader(datasource: DataSource, schema: StructType) -> 
"DataSourceStreamReader":
+"""
+Fallback to simpleStreamReader() method when streamReader() is not 
implemented.
+This should be invoked whenever a DataSourceStreamReader needs to be 
created instead of
+invoking datasource.streamReader() directly.
+"""
+try:
+return datasource.streamReader(schema=schema)
+except PySparkNotImplementedError:
+return 
_SimpleStreamReaderWrapper(datasource.simpleStreamReader(schema=schema))
+
+
+class SimpleInputPartition(InputPartition):

Review Comment:
   here 
https://github.com/apache/spark/tree/master/python/docs/source/user_guide/sql. 
feel free to do it separately, I don't mind.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-28 Thread via GitHub



HeartSaVioR commented on code in PR #45977:
URL: https://github.com/apache/spark/pull/45977#discussion_r1582466426


##
python/pyspark/sql/datasource_internal.py:
##
@@ -0,0 +1,146 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import json
+import copy
+from itertools import chain
+from typing import Iterator, List, Optional, Sequence, Tuple
+
+from pyspark.sql.datasource import (
+DataSource,
+DataSourceStreamReader,
+InputPartition,
+SimpleDataSourceStreamReader,
+)
+from pyspark.sql.types import StructType
+from pyspark.errors import PySparkNotImplementedError
+
+
+def _streamReader(datasource: DataSource, schema: StructType) -> 
"DataSourceStreamReader":
+"""
+Fallback to simpleStreamReader() method when streamReader() is not 
implemented.
+This should be invoked whenever a DataSourceStreamReader needs to be 
created instead of
+invoking datasource.streamReader() directly.
+"""
+try:
+return datasource.streamReader(schema=schema)
+except PySparkNotImplementedError:
+return 
_SimpleStreamReaderWrapper(datasource.simpleStreamReader(schema=schema))
+
+
+class SimpleInputPartition(InputPartition):
+def __init__(self, start: dict, end: dict):
+self.start = start
+self.end = end
+
+
+class PrefetchedCacheEntry:
+def __init__(self, start: dict, end: dict, iterator: Iterator[Tuple]):
+self.start = start
+self.end = end
+self.iterator = iterator
+
+
+class _SimpleStreamReaderWrapper(DataSourceStreamReader):
+"""
+A private class that wrap :class:`SimpleDataSourceStreamReader` in 
prefetch and cache pattern,
+so that :class:`SimpleDataSourceStreamReader` can integrate with streaming 
engine like an
+ordinary :class:`DataSourceStreamReader`.
+
+current_offset tracks the latest progress of the record prefetching, it is 
initialized to be
+initialOffset() when query start for the first time or initialized to be 
the end offset of
+the last committed batch when query restarts.
+
+When streaming engine calls latestOffset(), the wrapper calls read() that 
starts from
+current_offset, prefetches and cache the data, then updates the 
current_offset to be
+the end offset of the new data.
+
+When streaming engine call planInputPartitions(start, end), the wrapper 
get the prefetched data
+from cache and send it to JVM along with the input partitions.
+
+When query restart, batches in write ahead offset log that has not been 
committed will be
+replayed by reading data between start and end offset through read2(start, 
end).
+"""
+
+def __init__(self, simple_reader: SimpleDataSourceStreamReader):
+self.simple_reader = simple_reader
+self.initial_offset: Optional[dict] = None
+self.current_offset: Optional[dict] = None
+self.cache: List[PrefetchedCacheEntry] = []
+
+def initialOffset(self) -> dict:
+if self.initial_offset is None:
+self.initial_offset = self.simple_reader.initialOffset()
+return self.initial_offset
+
+def latestOffset(self) -> dict:
+# when query start for the first time, use initial offset as the start 
offset.

Review Comment:
   Let's avoid randomness - you can manipulate both tests on restarting 1) 
restarting from the query which does not have leftover batch 2) restarting from 
the query which does have leftover batch (planned-but-yet-to-be-committed). We 
have several tests which adjusts offset log and commit log to test the behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-04-28 Thread via GitHub



erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1582463921


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -51,13 +51,30 @@ abstract class QueryStageExec extends LeafExecNode {
*/
   val plan: SparkPlan
 
+  /**
+   * Name of this query stage which is unique in the entire query plan.
+   */
+  val name: String = s"${this.getClass.getSimpleName}-$id"
+
+  /**
+   * This flag aims to detect if the stage materialization is started. This 
helps
+   * to avoid unnecessary stage materialization when the stage is canceled.
+   */
+  private val materializationStarted = new AtomicBoolean()

Review Comment:
   I agree that we need clean API definition so please find new refactoring on 
this direction by last commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-04-28 Thread via GitHub



erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1582463921


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -51,13 +51,30 @@ abstract class QueryStageExec extends LeafExecNode {
*/
   val plan: SparkPlan
 
+  /**
+   * Name of this query stage which is unique in the entire query plan.
+   */
+  val name: String = s"${this.getClass.getSimpleName}-$id"
+
+  /**
+   * This flag aims to detect if the stage materialization is started. This 
helps
+   * to avoid unnecessary stage materialization when the stage is canceled.
+   */
+  private val materializationStarted = new AtomicBoolean()

Review Comment:
   I agree on this and we need clean API definition so please find new 
refactoring on this direction by last commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47292][SS] safeMapToJValue should consider null typed values [spark]

2024-04-28 Thread via GitHub



HeartSaVioR commented on PR #46260:
URL: https://github.com/apache/spark/pull/46260#issuecomment-2081738599

   Late +1. Thanks @WweiL !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-04-28 Thread via GitHub



erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1582461066


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -198,18 +215,23 @@ case class ShuffleQueryStageExec(
 reuse
   }
 
-  override def cancel(): Unit = shuffleFuture match {
-case action: FutureAction[MapOutputStatistics] if !action.isCompleted =>
-  action.cancel()
-case _ =>
+  override def cancel(): Unit = {

Review Comment:
   Added `doCancel()` to `ExchangeQueryStageExec` and both QueryStageExecs 
implement it. Please find last commit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48033][VARIANT] Fix `RuntimeReplaceable` expressions being used in default columns [spark]

2024-04-28 Thread via GitHub



richardc-db opened a new pull request, #46269:
URL: https://github.com/apache/spark/pull/46269

   
   
   ### What changes were proposed in this pull request?
   
   
   Currently, default columns that have a default of a `RuntimeReplaceable` 
expression fails.
   
   This is because the `AlterTableCommand` constant folds before replacing 
expressions with the actual implementation. For example:
   ```
   sql(s"CREATE TABLE t(v VARIANT DEFAULT parse_json('1')) USING PARQUET")
   sql("INSERT INTO t VALUES(DEFAULT)")
   ```
   fails because `parse_json` is `RuntimeReplaceable` and is evaluated before 
the analyzer inserts the correct expression into the plan
   
   
   ### Why are the changes needed?
   
   
   This allows default columns to use expressions that are `RuntimeReplaceable`
   
   This is especially important for Variant types because literal variants are 
difficult to create - `parse_json` will likely be used the majority of the time.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   
   
   ### How was this patch tested?
   
   
   added UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47292][SS] safeMapToJValue should consider null typed values [spark]

2024-04-28 Thread via GitHub



HyukjinKwon closed pull request #46260: [SPARK-47292][SS] safeMapToJValue 
should consider null typed values 
URL: https://github.com/apache/spark/pull/46260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47292][SS] safeMapToJValue should consider null typed values [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on PR #46260:
URL: https://github.com/apache/spark/pull/46260#issuecomment-2081724885

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48014][SQL] Change the makeFromJava error in EvaluatePython to a user-facing error [spark]

2024-04-28 Thread via GitHub



HyukjinKwon closed pull request #46250: [SPARK-48014][SQL] Change the 
makeFromJava error in EvaluatePython to a user-facing error
URL: https://github.com/apache/spark/pull/46250


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48014][SQL] Change the makeFromJava error in EvaluatePython to a user-facing error [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on PR #46250:
URL: https://github.com/apache/spark/pull/46250#issuecomment-2081724390

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image [spark]

2024-04-28 Thread via GitHub



panbingkun commented on PR #46258:
URL: https://github.com/apache/spark/pull/46258#issuecomment-2081715514

   cc @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44635][CORE] Handle shuffle fetch failures in decommissions [spark]

2024-04-28 Thread via GitHub



github-actions[bot] closed pull request #42296: [SPARK-44635][CORE] Handle 
shuffle fetch failures in decommissions
URL: https://github.com/apache/spark/pull/42296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [TEST ONLY][SQL] Test resolve column references with PLAN_ID [spark]

2024-04-28 Thread via GitHub



github-actions[bot] commented on PR #43115:
URL: https://github.com/apache/spark/pull/43115#issuecomment-2081715244

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46744][SPARK-SHELL][SQL][CONNECT][PYTHON][R] Display clear `exit command` for all spark terminal [spark]

2024-04-28 Thread via GitHub



github-actions[bot] closed pull request #44769: 
[SPARK-46744][SPARK-SHELL][SQL][CONNECT][PYTHON][R] Display clear `exit 
command` for all spark terminal
URL: https://github.com/apache/spark/pull/44769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on code in PR #45977:
URL: https://github.com/apache/spark/pull/45977#discussion_r1582440585


##
python/pyspark/sql/datasource.py:
##
@@ -469,6 +501,188 @@ def stop(self) -> None:
 ...
 
 
+class SimpleInputPartition(InputPartition):
+def __init__(self, start: dict, end: dict):
+self.start = start
+self.end = end
+
+
+class SimpleDataSourceStreamReader(ABC):
+"""
+A base class for simplified streaming data source readers. Compared to 
DataSourceStreamReader,
+SimpleDataSourceStreamReader doesn't require planning data partitioning. 
Also, the read api of
+SimpleDataSourceStreamReader allows reading data and planning the latest 
offset at the same time.
+
+.. versionadded: 4.0.0
+"""
+
+def initialOffset(self) -> dict:
+"""
+Return the initial offset of the streaming data source.
+A new streaming query starts reading data from the initial offset.
+If Spark is restarting an existing query, it will restart from the 
check-pointed offset
+rather than the initial one.
+
+Returns
+---
+dict
+A dict or recursive dict whose key and value are primitive types, 
which includes
+Integer, String and Boolean.
+
+Examples
+
+>>> def initialOffset(self):
+... return {"parititon-1": {"index": 3, "closed": True}, 
"partition-2": {"index": 5}}
+"""
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "initialOffset"},
+)
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Read all available data from specified start offset and return the 
offset that next read attempt
+starts from.
+
+Parameters
+--
+start : dict
+The start offset to start reading from.
+
+Returns
+---
+A tuple of an iterator of :class:`Tuple` and a dict\\s
+The iterator contains all the available records after start offset.
+The dict is the end of this read attempt and the start of next 
read attempt.
+"""
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "read"},
+)
+
+def read2(self, start: dict, end: dict) -> Iterator[Tuple]:

Review Comment:
   It can't have overloaded ones but it can dispatch by embedding if-else, and 
leveraging optional argument. e.g.,
   
   ```python
   def read(self, start: dict, end: dict = None) -> 
Union[Tuple[Iterator[Tuple], dict], Iterator[Tuple]:
   
   if end is None:
   return # logic for read(start)
   else:
   return # logic for read(start, end)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]

2024-04-28 Thread via GitHub



HyukjinKwon closed pull request #46262: 
[SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` 
reference in `pyspark.ml.stat`
URL: https://github.com/apache/spark/pull/46262


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on PR #46262:
URL: https://github.com/apache/spark/pull/46262#issuecomment-2081701064

   Thanks!
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 [spark]

2024-04-28 Thread via GitHub



panbingkun opened a new pull request, #46268:
URL: https://github.com/apache/spark/pull/46268

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46266:
URL: https://github.com/apache/spark/pull/46266#issuecomment-2081693301

   Looking forward to seeing the result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-28 Thread via GitHub



dongjoon-hyun commented on PR #46266:
URL: https://github.com/apache/spark/pull/46266#issuecomment-2081693160

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

2024-04-28 Thread via GitHub



neilramaswamy commented on code in PR #44323:
URL: https://github.com/apache/spark/pull/44323#discussion_r1582300122


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##
@@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging {
   attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
   condition,
   eventTimeWatermarkForEviction)
-val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+// If the condition itself is empty (for example, left_time < 
left_time + INTERVAL ...),
+// then we will not have generated a stateValueWatermark.
+if (stateValueWatermark.isEmpty) {
+  None
+} else {
+  // For example, if the condition is of the form:
+  //left_time > right_time + INTERVAL 30 MINUTES
+  // Then this extracts left_time and right_time.
+  val attributesInCondition = AttributeSet(
+condition.get.collect { case a: AttributeReference => a }
+  )
+
+  // Construct an AttributeSet so that we can perform equality between 
attributes,
+  // which we do in the filter below.
+  val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+  // oneSideInputAttributes could be [left_value, left_time], and we 
just
+  // want the attribute _in_ the time-interval condition.
+  val oneSideStateWatermarkAttributes = attributesInCondition.filter { 
a =>
+oneSideInputAttributeSet.contains(a)
+  }
+
+  // There should be a single attribute per side in the time-interval 
condition, so,
+  // filtering for oneSideInputAttributes as done above should lead us 
with 1 attribute.
+  if (oneSideStateWatermarkAttributes.size == 1) {
+val expr =
+  watermarkExpression(Some(oneSideStateWatermarkAttributes.head), 
stateValueWatermark)
+expr.map(JoinStateValueWatermarkPredicate.apply _)
+  } else {
+// This should never happen, since the grammar will ensure that we 
have one attribute

Review Comment:
   Good question. I thought more about this, and I actually think that I might 
be wrong in the case of an edge-case we don't have in any of our tests: if the 
user does:
   
   `left_time > right_time + m AND other_left_time > right_time + n`, there 
will be _three_ attributes in the condition. Then, 
`oneSideStateWatermarkAttributes.size` will be 2 (it will be `left_time` and 
`other_left_time`, neither of which are watermark attributes), and the 
condition that we need to return would be a conjunctive watermark predicate: 
`left_time <= watermark(right) + m AND other_left_time <= watermark(right) + n`.
   
   We can remove state, but I'm pretty sure the current implementation in Spark 
master would fail. I need to check this. It might be out-of-scope for this PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-48031] view evolution [spark]

2024-04-28 Thread via GitHub



srielau opened a new pull request, #46267:
URL: https://github.com/apache/spark/pull/46267

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46266:
URL: https://github.com/apache/spark/pull/46266#issuecomment-2081486023

   
   Waiting for CI results
   
   https://github.com/yaooqinn/spark/actions/runs/8867970355
   https://github.com/yaooqinn/spark/actions/runs/8867971118


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-28 Thread via GitHub



yaooqinn opened a new pull request, #46266:
URL: https://github.com/apache/spark/pull/46266

   ### What changes were proposed in this pull request?
   
   This PR aims to fix benchmark errors and regenerate benchmark results for 
Apache Spark 4.0.0 after turning ANSI on.
   
   The latest baseline has been updated by SPARK-47513.
   ### Why are the changes needed?
   
   SPARK-4 turns ANSI on by default, there could be performance related 
issues
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manual review.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]

2024-04-28 Thread via GitHub



advancedxy commented on PR #46265:
URL: https://github.com/apache/spark/pull/46265#issuecomment-2081471602

   @sunchao, @szehon-ho and @yabola would you mind to take a look at this and 
help review this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]

2024-04-28 Thread via GitHub



advancedxy commented on code in PR #46265:
URL: https://github.com/apache/spark/pull/46265#discussion_r1582148061


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/InternalRowComparableWrapper.scala:
##
@@ -53,6 +55,21 @@ class InternalRowComparableWrapper(val row: InternalRow, val 
dataTypes: Seq[Data
 }
 
 object InternalRowComparableWrapper {
+  private final val MAX_CACHE_ENTRIES = 1024

Review Comment:
   1024 should be sufficient, it could be a SQL configuration though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]

2024-04-28 Thread via GitHub



advancedxy commented on PR #46265:
URL: https://github.com/apache/spark/pull/46265#issuecomment-2081471221

   Before applying changes in this PR, the benchmark code(in this PR) took:
   ```
   [info] OpenJDK 64-Bit Server VM 21.0.2 on Mac OS X 14.4.1
   [info] Apple M1
   [info] internal row comparable wrapper:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 

   [info] toSet 10785  
11369 825  0.0   53925.3   1.0X
   [info] mergePartitions   21073  
21447 528  0.0  105364.5   0.5X
   ```
   
   After this PR:
   ```
   [info] OpenJDK 64-Bit Server VM 21.0.2 on Mac OS X 14.4.1
   [info] Apple M1
   [info] internal row comparable wrapper:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 

   [info] toSet85
103  25  2.4 425.0   1.0X
   [info] mergePartitions 163
167   4  1.2 816.3   0.5X
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]

2024-04-28 Thread via GitHub



advancedxy opened a new pull request, #46265:
URL: https://github.com/apache/spark/pull/46265

   ### What changes were proposed in this pull request?
   Cache rowOrdering and structType for InternalRowComparableWrapper
   
   ### Why are the changes needed?
   For performance issues
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   ### How was this patch tested?
   Added a new benchmark to verify the performance improvement
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



beliefer commented on code in PR #46263:
URL: https://github.com/apache/spark/pull/46263#discussion_r1582095704


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
-  extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
-currentPlan = left, targetKey = targetKey).orElse {
-// We can also extract from the right side if the join keys are 
transitive.
+  val extractLeft = if (canExtractLeft(joinType)) {
+extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
+  currentPlan = left, targetKey = targetKey)
+  } else {
+None
+  }
+  val extractRight = if (canExtractRight(joinType)) {
 lkeys.zip(rkeys).find(_._1.semanticEquals(targetKey)).map(_._2)
   .flatMap { newTargetKey =>
 extract(right, AttributeSet.empty,
   hasHitFilter = false, hasHitSelectiveFilter = false, 
currentPlan = right,
   targetKey = newTargetKey)
   }
+  } else {
+None
   }
+  extractLeft.orElse(extractRight)
 } else if (right.output.exists(_.semanticEquals(targetKey))) {
-  extract(right, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
-currentPlan = right, targetKey = targetKey).orElse {
+  val extractRight = if (canExtractRight(joinType)) {
+extract(right, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
+  currentPlan = right, targetKey = targetKey)
+  } else {
+None
+  }
+  val extractLeft = if (canExtractLeft(joinType)) {
 // We can also extract from the left side if the join keys are 
transitive.
 rkeys.zip(lkeys).find(_._1.semanticEquals(targetKey)).map(_._2)
   .flatMap { newTargetKey =>
 extract(left, AttributeSet.empty,
   hasHitFilter = false, hasHitSelectiveFilter = false, 
currentPlan = left,
   targetKey = newTargetKey)
   }
+  } else {
+None
   }
+  extractRight.orElse(extractLeft)

Review Comment:
   ditto



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -86,6 +87,17 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   private def extractSelectiveFilterOverScan(
   plan: LogicalPlan,
   filterCreationSideKey: Expression): Option[(Expression, LogicalPlan)] = {
+
+def canExtractRight(joinType: JoinType): Boolean = joinType match {
+  case Inner | LeftSemi | RightOuter => true
+  case _ => false
+}
+
+def canExtractLeft(joinType: JoinType): Boolean = joinType match {
+  case Inner | LeftSemi | LeftOuter | LeftAnti => true
+  case _ => false
+}

Review Comment:
   Shall we add the two methods into `joins.scala`?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   hasHitSelectiveFilter = hasHitSelectiveFilter || 
isLikelySelective(condition),
   currentPlan,
   targetKey)
-  case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) =>
+  case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
 // (ignore null values) to do the pruning.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
-  extract(left,

[PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]

2024-04-28 Thread via GitHub



panbingkun opened a new pull request, #46264:
URL: https://github.com/apache/spark/pull/46264

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates [spark]

2024-04-28 Thread via GitHub



beliefer opened a new pull request, #42223:
URL: https://github.com/apache/spark/pull/42223

   ### What changes were proposed in this pull request?
   Some queries contains multiple scalar subquery(aggregation without group by 
clause) and connected with join. The general form of joined aggregates that can 
be merged as follows:
   ```
::=
 SUM | AVG | MAX | ...
   
::=
 SELECT
   (...)[ , (...)[ , ...]]
 FROM [tab | query]
   
::=
 SELECT *
 FROM (
[INNER | CROSS | LEFT | RIGHT | FULL OUTER] JOIN
[INNER | CROSS | LEFT | RIGHT | FULL OUTER] JOIN
 ...
[INNER | CROSS | LEFT | RIGHT | FULL OUTER] JOIN
   
 )
   ```
 
   
   For example,
   ```
   SELECT *
   FROM (SELECT
 avg(power) avg_power,
 count(power) count_power,
 count(DISTINCT power) count_distinct_power
   FROM data
   WHERE country = "USA"
 AND (id BETWEEN 1 AND 3
 OR city = "Berkeley"
 OR name = "Xiao")) B1,
 (SELECT
   avg(power) avg_power,
   count(power) count_power,
   count(DISTINCT power) count_distinct_power
 FROM data
 WHERE country = "China"
   AND (id BETWEEN 4 AND 5
   OR city = "Hangzhou"
   OR name = "Wenchen")) B2
   ```
   We can optimize this SQL to the form shown below:
   ```
   SELECT
 avg(power) avg_power FILTER (country = "USA" AND (id BETWEEN 1 AND 3 OR 
city = "Berkeley" OR name = "Xiao")),
 count(power) count_power FILTER (country = "USA" AND (id BETWEEN 1 AND 3 
OR city = "Berkeley" OR name = "Xiao")),
 count(DISTINCT power) FILTER (country = "USA" AND (id BETWEEN 1 AND 3 OR 
city = "Berkeley" OR name = "Xiao")),
 avg(power) FILTER (country = "China" AND (id BETWEEN 4 AND 5 OR city = 
"Hangzhou" OR name = "Wenchen")),
 count(power) FILTER (country = "China" AND (id BETWEEN 4 AND 5 OR city = 
"Hangzhou" OR name = "Wenchen")),
 count(DISTINCT power) FILTER (country = "China" AND (id BETWEEN 4 AND 5 OR 
city = "Hangzhou" OR name = "Wenchen"))
   FROM data
   WHERE
   (country = "USA"
 AND (id BETWEEN 1 AND 3
 OR city = "Berkeley"
 OR name = "Xiao")) OR
   (country = "China"
   AND (id BETWEEN 4 AND 5
   OR city = "Hangzhou"
   OR name = "Wenchen"))
   ```
   If we can merge the filters and aggregates, we can scan data source only 
once and eliminate the join so as avoid shuffle.
   
   This PR also supports eliminate nested Join, please refer to: 
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q28.sql
   
   Obviously, this change will improve the performance.
   
   This PR also reuse some functions come from `MergeScalarSubqueries`.
   
   This PR also add some `TreePattern` for easy to check the cost of predicate.
   
   ### Why are the changes needed?
   Improve the performance for the case show above.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   1. new test cases
   2. new micro benchmark.
   
   ```
   Benchmark CombineJoinedAggregates: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   
-
   filter is not defined, CombineJoinedAggregates: false730 
   819  69 28.7  34.8   1.0X
   filter is not defined, CombineJoinedAggregates: true 618 
   632  14 33.9  29.5   1.2X
   step is 100, CombineJoinedAggregates: false  572 
   590  20 36.7  27.3   1.3X
   step is 100, CombineJoinedAggregates: true)  769 
   794  21 27.3  36.6   1.0X
   step is 10, CombineJoinedAggregates: false   350 
   370  26 59.9  16.7   2.1X
   step is 10, CombineJoinedAggregates: true)   231 
   241  10 90.7  11.0   3.2X
   step is 1, CombineJoinedAggregates: false314 
   340  26 66.8  15.0   2.3X
   step is 1, CombineJoinedAggregates: true)171 
   182   9122.5   8.2   4.3X
   step is 1000, CombineJoinedAggregates: false 303 
   337  32 69.3  14.4   2.4X
   step is 1000, CombineJoinedAggregates: true) 162 
   171   9129.4   7.7   4.5X
   step is 100, CombineJoinedAggregates: false  300 
   316  27 70.0  14.3   2.4X
   step is 100, CombineJoinedAggregates: true)  160 
   169   9131.3   7.6   4.6X
   step is 10,

[PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]

2024-04-28 Thread via GitHub



AngersZh opened a new pull request, #46263:
URL: https://github.com/apache/spark/pull/46263

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46261:
URL: https://github.com/apache/spark/pull/46261#issuecomment-2081416283

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn closed pull request #46261: [SPARK-48025][SQL][TESTS] Fix 
org.apache.spark.sql.execution.benchmark.DateTimeBenchmark
URL: https://github.com/apache/spark/pull/46261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on PR #46261:
URL: https://github.com/apache/spark/pull/46261#issuecomment-2081402895

   Thank you @HyukjinKwon. I have dispatched a CI task to audit, 
https://github.com/yaooqinn/spark/actions/runs/8866497258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on PR #46188:
URL: https://github.com/apache/spark/pull/46188#issuecomment-2081399740

   is the end goal to automatically update table statistics?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-28 Thread via GitHub



cloud-fan commented on code in PR #46188:
URL: https://github.com/apache/spark/pull/46188#discussion_r1582063564


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala:
##
@@ -223,6 +278,9 @@ class BasicWriteJobStatsTracker(
 
 val executionId = 
sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
 SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, 
driverSideMetrics.values.toList)
+
+SQLPartitionMetrics.postDriverMetricUpdates(sparkContext, executionId,

Review Comment:
   These new partition metrics are not shown in the Spark UI, but a listener 
event that third-party libraries can consume?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on PR #46262:
URL: https://github.com/apache/spark/pull/46262#issuecomment-2081397805

   cc @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48024][PYTHON][CONNECT][TESTS] Enable `UDFParityTests.test_udf_timestamp_ntz` [spark]

2024-04-28 Thread via GitHub



HyukjinKwon closed pull request #46257: [SPARK-48024][PYTHON][CONNECT][TESTS] 
Enable `UDFParityTests.test_udf_timestamp_ntz`
URL: https://github.com/apache/spark/pull/46257


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48024][PYTHON][CONNECT][TESTS] Enable `UDFParityTests.test_udf_timestamp_ntz` [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on PR #46257:
URL: https://github.com/apache/spark/pull/46257#issuecomment-2081397408

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]

2024-04-28 Thread via GitHub



HyukjinKwon opened a new pull request, #46262:
URL: https://github.com/apache/spark/pull/46262

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/46155 that 
removes the reference of `_to_seq` that `pyspark-connect` package does not have.
   
   ### Why are the changes needed?
   
   To recover the CI 
https://github.com/apache/spark/actions/runs/8821919392/job/24218893631
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, the main change has not been released out yet.
   
   ### How was this patch tested?
   
   Manually tested.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48002][PYTHON][SS] Add test for observed metrics in PySpark StreamingQueryListener [spark]

2024-04-28 Thread via GitHub



HyukjinKwon closed pull request #46237: [SPARK-48002][PYTHON][SS] Add test for 
observed metrics in PySpark StreamingQueryListener
URL: https://github.com/apache/spark/pull/46237


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48002][PYTHON][SS] Add test for observed metrics in PySpark StreamingQueryListener [spark]

2024-04-28 Thread via GitHub



HyukjinKwon commented on PR #46237:
URL: https://github.com/apache/spark/pull/46237#issuecomment-2081395666

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48021][ML][BUILD][FOLLOWUP] add `--add-modules=jdk.incubator.vector` to maven compile args [spark]

2024-04-28 Thread via GitHub



LuciferYang commented on PR #46259:
URL: https://github.com/apache/spark/pull/46259#issuecomment-2081389478

   Merged into master for Spark 4.0. Thanks @panbingkun and @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48021][ML][BUILD][FOLLOWUP] add `--add-modules=jdk.incubator.vector` to maven compile args [spark]

2024-04-28 Thread via GitHub



LuciferYang closed pull request #46259: [SPARK-48021][ML][BUILD][FOLLOWUP] add 
`--add-modules=jdk.incubator.vector` to maven compile args
URL: https://github.com/apache/spark/pull/46259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on code in PR #46261:
URL: https://github.com/apache/spark/pull/46261#discussion_r1582051605


##
sql/core/benchmarks/DateTimeBenchmark-jdk21-results.txt:
##
@@ -2,460 +2,460 @@
 datetime +/- interval
 

 
-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
 AMD EPYC 7763 64-Core Processor
 datetime +/- interval:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-date + interval(m)  850887 
 33 11.8  85.0   1.0X
-date + interval(m, d)   863864 
  2 11.6  86.3   1.0X
-date + interval(m, d, ms)  3507   3511 
  5  2.9 350.7   0.2X
-date - interval(m)  841851 
  9 11.9  84.1   1.0X
-date - interval(m, d)   864870 
  5 11.6  86.4   1.0X
-date - interval(m, d, ms)  3518   3519 
  2  2.8 351.8   0.2X
-timestamp + interval(m)1756   1759 
  5  5.7 175.6   0.5X
-timestamp + interval(m, d) 1802   1805 
  4  5.5 180.2   0.5X
-timestamp + interval(m, d, ms) 1958   1961 
  4  5.1 195.8   0.4X
-timestamp - interval(m)1744   1745 
  2  5.7 174.4   0.5X
-timestamp - interval(m, d) 1796   1799 
  4  5.6 179.6   0.5X
-timestamp - interval(m, d, ms) 1944   1947 
  5  5.1 194.4   0.4X
+date + interval(m) 1149   1158 
 12  8.7 114.9   1.0X
+date + interval(m, d)  1136   1137 
  1  8.8 113.6   1.0X
+date + interval(m, d, ms)  3779   3799 
 29  2.6 377.9   0.3X
+date - interval(m) 1113   1116 
  4  9.0 111.3   1.0X
+date - interval(m, d)  1124   1141 
 25  8.9 112.4   1.0X
+date - interval(m, d, ms)  3795   3796 
  1  2.6 379.5   0.3X
+timestamp + interval(m)1528   1530 
  3  6.5 152.8   0.8X
+timestamp + interval(m, d) 1581   1585 
  6  6.3 158.1   0.7X
+timestamp + interval(m, d, ms) 2037   2044 
 10  4.9 203.7   0.6X
+timestamp - interval(m)1786   1790 
  6  5.6 178.6   0.6X
+timestamp - interval(m, d) 1865   1872 
 10  5.4 186.5   0.6X
+timestamp - interval(m, d, ms) 2038   2054 
 23  4.9 203.8   0.6X
 
 
 

 Extract components
 

 
-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
 AMD EPYC 7763 64-Core Processor
 cast to timestamp:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-cast to timestamp wholestage off209209 
  0 47.9  20.9   1.0X
-cast to timestamp wholestage on 209225 
 15 47.8  20.9   1.0X
+cast to timestamp wholestage off192198 
  9 52.2  19.2   1.0X
+cast to timestamp wholestage on 206213 
  6 48.5  20.6   0.9X
 
-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
 AMD EPYC 7763 64-Core Processor
 year

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on code in PR #46261:
URL: https://github.com/apache/spark/pull/46261#discussion_r1582049065


##
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala:
##
@@ -161,7 +167,7 @@ object DateTimeBenchmark extends SqlBasedBenchmark {
   }
   val dateExpr = "cast(timestamp_seconds(id) as date)"
   Seq("year", "", "yy", "mon", "month", "mm").foreach { level =>
-run(N, s"trunc $level", s"trunc('$level', $dateExpr)")
+run(N, s"trunc $level", s"trunc($dateExpr, '$level')")

Review Comment:
   the parameter order is wrong



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on code in PR #46261:
URL: https://github.com/apache/spark/pull/46261#discussion_r1582049184


##
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala:
##
@@ -171,7 +177,7 @@ object DateTimeBenchmark extends SqlBasedBenchmark {
   run(n, "to timestamp str", timestampStrExpr)
   run(n, "to_timestamp", s"to_timestamp($timestampStrExpr, $pattern)")
   run(n, "to_unix_timestamp", s"to_unix_timestamp($timestampStrExpr, 
$pattern)")
-  val dateStrExpr = "concat('2019-01-', lpad(mod(id, 25), 2, '0'))"
+  val dateStrExpr = "concat('2019-01-', lpad(mod(id, 25) + 1, 2, '0'))"

Review Comment:
   avoid invalid date value `2019-01-00`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn commented on code in PR #46261:
URL: https://github.com/apache/spark/pull/46261#discussion_r1582048982


##
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala:
##
@@ -75,7 +81,7 @@ object DateTimeBenchmark extends SqlBasedBenchmark {
 doBenchmark(N, s"$dt + interval 1 month 2 day")
   }
   benchmark.addCase("date + interval(m, d, ms)") { _ =>
-doBenchmark(N, s"$dt + interval 1 month 2 day 5 hour")
+doBenchmarkAnsiOff(N, s"$dt + interval 1 month 2 day 5 hour")

Review Comment:
   illegal hour portion for ansi date add



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]

2024-04-28 Thread via GitHub



yaooqinn opened a new pull request, #46261:
URL: https://github.com/apache/spark/pull/46261

   
   
   
   ### What changes were proposed in this pull request?
   
   
   This PR fixes several issues in 
org.apache.spark.sql.execution.benchmark.DateTimeBenchmark
   - Misuse `trunc` function, a.k.a, the parameter order is wrong in reverse 
order
   - Some benchmarks not compatible with ANSI by default
   
   ### Why are the changes needed?
   
   
   restore benchmark cases
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   benchmark
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

100 matches

Mail list logo