Re: [PR] [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page [spark]
dongjoon-hyun closed pull request #46274: [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page URL: https://github.com/apache/spark/pull/46274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page [spark]
dongjoon-hyun commented on PR #46274: URL: https://github.com/apache/spark/pull/46274#issuecomment-2081920450 Merged to master~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun closed pull request #46271: [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` URL: https://github.com/apache/spark/pull/46271 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun commented on PR #46271: URL: https://github.com/apache/spark/pull/46271#issuecomment-2081891190 I attached the screenshots too. Merged to master~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page [spark]
HyukjinKwon opened a new pull request, #46274: URL: https://github.com/apache/spark/pull/46274 ### What changes were proposed in this pull request? This PR removes a space in the middle of configuration name in Arrow-optimized Python UDF page. ![Screenshot 2024-04-29 at 1 53 42 PM](https://github.com/apache/spark/assets/6477701/46b7c448-fb30-4838-a5ba-c8f1c23398fd) https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#arrow-python-udfs ### Why are the changes needed? So users can copy and paste the configuration names properly. ### Does this PR introduce _any_ user-facing change? Yes it fixes the doc. ### How was this patch tested? Manually built the docs, and checked. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun commented on PR #46271: URL: https://github.com/apache/spark/pull/46271#issuecomment-2081887536 Thank you for helping me revise this doc, @yaooqinn ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun commented on code in PR #46271: URL: https://github.com/apache/spark/pull/46271#discussion_r1582534717 ## docs/sql-ref-ansi-compliance.md: ## @@ -67,10 +67,8 @@ The following subsections present behaviour changes in arithmetic operations, ty ### Arithmetic Operations -In Spark SQL, arithmetic operations performed on numeric types (with the exception of decimal) are not checked for overflows by default. -This means that in case an operation causes overflows, the result is the same with the corresponding operation in a Java/Scala program (e.g., if the sum of 2 integers is higher than the maximum value representable, the result is a negative number). -On the other hand, Spark SQL returns null for decimal overflows. -When `spark.sql.ansi.enabled` is set to `true` and an overflow occurs in numeric and interval arithmetic operations, it throws an arithmetic exception at runtime. +In Spark SQL, by default, Spark throws an artithmetic exception at runtime for non-decimal numeric overflows and returns null for decimal type overflows. Review Comment: I agree that it looks much clear and has the missed `interval` part. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image [spark]
dongjoon-hyun closed pull request #46258: [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image URL: https://github.com/apache/spark/pull/46258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image [spark]
dongjoon-hyun commented on PR #46258: URL: https://github.com/apache/spark/pull/46258#issuecomment-2081880744 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]
cxzl25 opened a new pull request, #46273: URL: https://github.com/apache/spark/pull/46273 ### What changes were proposed in this pull request? This PR aims to fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data. ### Why are the changes needed? When the shuffle writer is SortShuffleWriter, it does not use SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain runtime statistics and the rowCount obtained is 0. Some optimization rules rely on rowCount statistics, such as `EliminateLimits`. Because rowCount is 0, it removes the limit operator. At this time, we get data results without limit. https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172 https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070 ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Production environment verification. **master metrics** https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;> **PR metrics** https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;> ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
yaooqinn commented on PR #46270: URL: https://github.com/apache/spark/pull/46270#issuecomment-2081840828 Merged to master(4.0.0), 3.5.2 and 3.4.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
yaooqinn closed pull request #46270: [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark URL: https://github.com/apache/spark/pull/46270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun commented on PR #46271: URL: https://github.com/apache/spark/pull/46271#issuecomment-2081838017 Thank you so much, @yaooqinn ! I updated the PR according to your comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
yaooqinn commented on PR #46270: URL: https://github.com/apache/spark/pull/46270#issuecomment-2081825069 Thank you very much @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
yaooqinn commented on PR #46271: URL: https://github.com/apache/spark/pull/46271#issuecomment-2081821772 LGTM, only [content here](https://github.com/apache/spark/pull/46271/files#diff-54eee79bd27cf5ca1288b078a7b0b1b5ae8ae9d8b4ee7fb75f0b0c7cdaef0da8L70-L73) might need further revison -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on PR #46264: URL: https://github.com/apache/spark/pull/46264#issuecomment-2081820429 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582501170 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -124,8 +130,8 @@ object LogKeys { case object DIFF_DELTA extends LogKey case object DIVISIBLE_CLUSTER_INDICES_SIZE extends LogKey case object DRIVER_ID extends LogKey - case object DROPPED_PARTITIONS extends LogKey Review Comment: Rename `DROPPED_PARTITIONS` to `NUM_DROPPED_PARTITIONS` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
dongjoon-hyun commented on PR #46270: URL: https://github.com/apache/spark/pull/46270#issuecomment-2081817478 Feel free to merge and backport wherever you need this, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun commented on PR #46271: URL: https://github.com/apache/spark/pull/46271#issuecomment-2081817738 Could you review this documentation PR, @yaooqinn ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582501464 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -191,20 +203,23 @@ object LogKeys { case object HIVE_OPERATION_TYPE extends LogKey case object HOST extends LogKey case object HOST_PORT extends LogKey - case object IDENTIFIER extends LogKey Review Comment: Currently, only `two places` use this `key`, and they all mean `the name of the table`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582501170 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -124,8 +130,8 @@ object LogKeys { case object DIFF_DELTA extends LogKey case object DIVISIBLE_CLUSTER_INDICES_SIZE extends LogKey case object DRIVER_ID extends LogKey - case object DROPPED_PARTITIONS extends LogKey Review Comment: Rename `DROPPED_PARTITIONS` to `NUM_DROPPED_PARTITIONS` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582501097 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -111,7 +117,7 @@ object LogKeys { case object DATA_FILE_NUM extends LogKey case object DATA_SOURCE extends LogKey case object DATA_SOURCES extends LogKey - case object DATA_SOURCE_PROVIDER extends LogKey Review Comment: Unified as` LogKeys.DATA_SOURCE`, remove `LogKeys.DATA_SOURCE_PROVIDER` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582500958 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -191,20 +203,23 @@ object LogKeys { case object HIVE_OPERATION_TYPE extends LogKey case object HOST extends LogKey case object HOST_PORT extends LogKey - case object IDENTIFIER extends LogKey Review Comment: Unified as` LogKeys.TABLE_NAME`, remove `LogKeys.IDENTIFIER` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582500836 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -276,31 +299,40 @@ object LogKeys { case object NUM_FILES_REUSED extends LogKey case object NUM_FREQUENT_ITEMS extends LogKey case object NUM_ITERATIONS extends LogKey + case object NUM_LEFT_PARTITION_VALUES extends LogKey + case object NUM_LOADED_ENTRIES extends LogKey case object NUM_LOCAL_FREQUENT_PATTERN extends LogKey case object NUM_PARTITION extends LogKey + case object NUM_PARTITION_VALUES extends LogKey case object NUM_POINT extends LogKey case object NUM_PREFIXES extends LogKey + case object NUM_RIGHT_PARTITION_VALUES extends LogKey case object NUM_SEQUENCES extends LogKey + case object NUM_VERSIONS_RETAIN extends LogKey + case object OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD extends LogKey case object OBJECT_ID extends LogKey case object OFFSET extends LogKey case object OFFSETS extends LogKey case object OFFSET_SEQUENCE_METADATA extends LogKey case object OLD_BLOCK_MANAGER_ID extends LogKey case object OLD_VALUE extends LogKey + case object OPEN_COST_IN_BYTES extends LogKey case object OPTIMIZED_PLAN_COLUMNS extends LogKey case object OPTIMIZER_CLASS_NAME extends LogKey case object OPTIONS extends LogKey case object OP_ID extends LogKey case object OP_TYPE extends LogKey + case object OUTPUT extends LogKey case object OVERHEAD_MEMORY_SIZE extends LogKey case object PARSE_MODE extends LogKey + case object PARTITIONED_FILE extends LogKey case object PARTITIONED_FILE_READER extends LogKey - case object PARTITIONS_SIZE extends LogKey Review Comment: Unified as` LogKeys.COUNT`, remove `LogKeys.PARTITIONS_SIZE` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582500556 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -349,11 +387,11 @@ object LogKeys { case object RETRY_COUNT extends LogKey case object RETRY_INTERVAL extends LogKey case object RIGHT_EXPR extends LogKey + case object RIGHT_LOGICAL_PLAN_STATS_SIZE_IN_BYTES extends LogKey case object RMSE extends LogKey case object ROCKS_DB_LOG_LEVEL extends LogKey case object ROCKS_DB_LOG_MESSAGE extends LogKey case object RPC_ENDPOINT_REF extends LogKey - case object RULE_BATCH_NAME extends LogKey Review Comment: Unified as `LogKeys.BATCH_NAME`, remove `LogKeys.RULE_BATCH_NAME` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]
yaooqinn commented on PR #46266: URL: https://github.com/apache/spark/pull/46266#issuecomment-2081812844 Pending CI results https://github.com/yaooqinn/spark/actions/runs/8872679083 https://github.com/yaooqinn/spark/actions/runs/8872179637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` [spark]
dongjoon-hyun opened a new pull request, #46271: URL: https://github.com/apache/spark/pull/46271 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
yaooqinn commented on code in PR #46270: URL: https://github.com/apache/spark/pull/46270#discussion_r1582497889 ## core/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala: ## @@ -123,7 +123,6 @@ object MapStatusesSerDeserBenchmark extends BenchmarkBase { } override def afterAll(): Unit = { -tracker.stop() Review Comment: yes, it was suppressed by tryLogNonFatalError -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
yaooqinn commented on code in PR #46270: URL: https://github.com/apache/spark/pull/46270#discussion_r1582497889 ## core/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala: ## @@ -123,7 +123,6 @@ object MapStatusesSerDeserBenchmark extends BenchmarkBase { } override def afterAll(): Unit = { -tracker.stop() Review Comment: yes, the error was suppressed by tryLogNonFatalError -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582496007 ## sql/core/src/main/scala/org/apache/spark/sql/execution/r/ArrowRRunner.scala: ## @@ -161,17 +166,14 @@ class ArrowRRunner( val input = dataStream.readDouble val compute = dataStream.readDouble val output = dataStream.readDouble - logInfo( -("Times: boot = %.3f s, init = %.3f s, broadcast = %.3f s, " + - "read-input = %.3f s, compute = %.3f s, write-output = %.3f s, " + - "total = %.3f s").format( - boot, - init, - broadcast, - input, - compute, - output, - boot + init + broadcast + input + compute + output)) + logInfo(log"Times: boot = ${MDC(LogKeys.BOOT, format(boot))} s, " + Review Comment: I'm not sure whether to prefix the following keys as `R_` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
cloud-fan commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582495818 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. // We assume other rules have already pushed predicates through join if possible. // So the predicate references won't pass on anymore. if (left.output.exists(_.semanticEquals(targetKey))) { - extract(left, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, -currentPlan = left, targetKey = targetKey).orElse { -// We can also extract from the right side if the join keys are transitive. Review Comment: let's enrich this comment following https://github.com/apache/spark/pull/46263/files#r1582495367 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
cloud-fan commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582492374 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. Review Comment: I'm glad we described the idea in the comments. It's clear that for certain join types, the join child output is not a superset of the join output for transitive join keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
cloud-fan commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582495367 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. // We assume other rules have already pushed predicates through join if possible. // So the predicate references won't pass on anymore. if (left.output.exists(_.semanticEquals(targetKey))) { - extract(left, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, Review Comment: Let's clarify the expected strategy a bit more: For the exact join key match, like the left table here, it's always OK to generate the runtime filter using this left table, no matter what the join type is. This is because left table always produce a superset of output of the join output regarding the left keys. For transitive join key match, it's different. The right table here does not always generate a superset output regarding left keys. Let's look at an example ``` left table: 1, 2, 3 right table, 3, 4 left outer join output: (1, null), (2, null), (3, 3) left keys: 1, 2, 3 ``` So we can't use right table to generate runtime filter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
cloud-fan commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582495367 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. // We assume other rules have already pushed predicates through join if possible. // So the predicate references won't pass on anymore. if (left.output.exists(_.semanticEquals(targetKey))) { - extract(left, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, Review Comment: Let's clarify the expected strategy a bit more: For the exact join key match, like the left table here, it's always OK to generate the runtime filter using this left table, no matter what the join type is. This is because left table always produce a superset of output of the join output regarding the left keys. For transitive join key match, it's different. The right table here does not always generate a superset output regarding left keys. Let's look at an example ``` left table: 1, 2, 3 right table, 3, 4 left outer join output: 1, 2, 3 ``` So we can't use right table to generate runtime filter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
cloud-fan commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582492374 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. Review Comment: I'm glad we described the idea in the comments. It's clear that for certain join types, the join child output is not a superset of the join output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
beliefer commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582090760 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -86,6 +87,17 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J private def extractSelectiveFilterOverScan( plan: LogicalPlan, filterCreationSideKey: Expression): Option[(Expression, LogicalPlan)] = { + +def canExtractRight(joinType: JoinType): Boolean = joinType match { + case Inner | LeftSemi | RightOuter => true + case _ => false +} + +def canExtractLeft(joinType: JoinType): Boolean = joinType match { + case Inner | LeftSemi | LeftOuter | LeftAnti => true + case _ => false +} Review Comment: Shall we add the two methods into `joins.scala`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
dongjoon-hyun commented on code in PR #46270: URL: https://github.com/apache/spark/pull/46270#discussion_r1582489367 ## core/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala: ## @@ -123,7 +123,6 @@ object MapStatusesSerDeserBenchmark extends BenchmarkBase { } override def afterAll(): Unit = { -tracker.stop() Review Comment: Oh, this seems to exist since Apache Spark 3.0.0, doesn't it? - https://github.com/apache/spark/pull/26169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48034][TESTS] NullPointerException in MapStatusesSerDeserBenchmark [spark]
yaooqinn opened a new pull request, #46270: URL: https://github.com/apache/spark/pull/46270 ### What changes were proposed in this pull request? This PR fixes an NPE in MapStatusesSerDeserBenchmark. The cause is that we try to stop the tracker twice. ``` 3197java.lang.NullPointerException: Cannot invoke "org.apache.spark.rpc.RpcEndpointRef.askSync(Object, scala.reflect.ClassTag)" because the return value of "org.apache.spark.MapOutputTracker.trackerEndpoint()" is null 3198 at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:541) 3199 at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:551) 3200 at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:1242) 3201 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:112) 3202 at org.apache.spark.SparkContext.$anonfun$stop$25(SparkContext.scala:2354) 3203 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1294) 3204 at org.apache.spark.SparkContext.stop(SparkContext.scala:2354) 3205 at org.apache.spark.SparkContext.stop(SparkContext.scala:2259) 3206 at org.apache.spark.MapStatusesSerDeserBenchmark$.afterAll(MapStatusesSerDeserBenchmark.scala:128) 3207 at org.apache.spark.benchmark.BenchmarkBase.main(BenchmarkBase.scala:80) 3208 at org.apache.spark.MapStatusesSerDeserBenchmark.main(MapStatusesSerDeserBenchmark.scala) 3209 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 3210 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) 3211 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 3212 at java.base/java.lang.reflect.Method.invoke(Method.java:568) 3213 at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128) 3214 at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1323) 3215 at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91) 3216 at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala) ``` ### Why are the changes needed? test bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47954][K8S] Support creating ingress entry for external UI access [spark]
dongjoon-hyun commented on PR #46184: URL: https://github.com/apache/spark/pull/46184#issuecomment-2081769824 Gentle ping, @pan3793 . It would be great if we can land this before Apache Spark 4.0.0-preview. To @cloud-fan , when do you think you are going to cut `4.0.0-preview` tag? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582479845 ## sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala: ## @@ -340,11 +342,11 @@ trait AggregateCodegenSupport } Some(splitCodes) } else { -val errMsg = "Failed to split aggregate code into small functions because the parameter " + - "length of at least one split function went over the JVM limit: " + - CodeGenerator.MAX_JVM_METHOD_PARAMS_LENGTH +val errMsg = log"Failed to split aggregate code into small functions because the " + + log"parameter length of at least one split function went over the JVM limit: " + + log"${MDC(MAX_JVM_METHOD_PARAMS_LENGTH, CodeGenerator.MAX_JVM_METHOD_PARAMS_LENGTH)}" if (Utils.isTesting) { - throw SparkException.internalError(errMsg) + throw SparkException.internalError(errMsg.message) } else { logInfo(errMsg) Review Comment: For it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]
yaooqinn commented on PR #46266: URL: https://github.com/apache/spark/pull/46266#issuecomment-2081767859 Thank you @dongjoon-hyun. Unfortunately, they timed out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 [spark]
dongjoon-hyun commented on PR #46268: URL: https://github.com/apache/spark/pull/46268#issuecomment-2081766387 Thank you, @panbingkun . Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 [spark]
dongjoon-hyun closed pull request #46268: [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 URL: https://github.com/apache/spark/pull/46268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582478002 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -468,7 +467,7 @@ abstract class Optimizer(catalogManager: CatalogManager) } else if (filteredRules.nonEmpty) { Some(Batch(batch.name, batch.strategy, filteredRules: _*)) } else { - logInfo(log"Optimization batch '${MDC(RULE_BATCH_NAME, batch.name)}' " + + logInfo(log"Optimization batch '${MDC(LogKeys.RULE_BATCH_NAME, batch.name)}' " + Review Comment: Unified as `LogKeys.BATCH_NAME`, remove `LogKeys.RULE_BATCH_NAME` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582474060 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala: ## @@ -494,7 +495,7 @@ object DataSourceStrategy val partitionSet = AttributeSet(partitionColumns) val predicates = ExpressionSet(normalizedFilters .flatMap(extractPredicatesWithinOutputSet(_, partitionSet))) - logInfo(s"Pruning directories with: ${predicates.mkString(",")}") + logInfo(log"Pruning directories with: ${MDC(PREDICATES, predicates.mkString(","))}") Review Comment: I'm not sure if `EXPRS` is more general. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels [spark]
dongjoon-hyun commented on PR #46149: URL: https://github.com/apache/spark/pull/46149#issuecomment-2081755480 I added you (Xi Chen) to Apache Spark contributor group and assigned SPARK-47730 to you. Welcome to the Apache Spark community, @jshmchenxi ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582473304 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala: ## @@ -233,7 +233,7 @@ class FileScanRDD( if (files.hasNext) { currentFile = files.next() updateMetadataRow() - logInfo(s"Reading File $currentFile") + logInfo(log"Reading File ${MDC(PARTITIONED_FILE, currentFile)}") Review Comment: The class of `currentFile` is `PartitionedFile`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels [spark]
dongjoon-hyun closed pull request #46149: [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels URL: https://github.com/apache/spark/pull/46149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46264: URL: https://github.com/apache/spark/pull/46264#discussion_r1582471773 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala: ## @@ -39,7 +39,7 @@ class FilePartitionReader[T]( if (currentReader == null) { if (files.hasNext) { val file = files.next() -logInfo(s"Reading file $file") +logInfo(log"Reading file ${MDC(PARTITIONED_FILE, file)}") Review Comment: The class of `file` is `PartitionedFile`, So `LogKeys.PATH` is not used https://github.com/apache/spark/blob/8c446f35dc038e92ddc36f375c304a8879ac6f5f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L70-L72 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]
HyukjinKwon commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1582469363 ## python/pyspark/sql/datasource_internal.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +import json +import copy +from itertools import chain +from typing import Iterator, List, Optional, Sequence, Tuple + +from pyspark.sql.datasource import ( +DataSource, +DataSourceStreamReader, +InputPartition, +SimpleDataSourceStreamReader, +) +from pyspark.sql.types import StructType +from pyspark.errors import PySparkNotImplementedError + + +def _streamReader(datasource: DataSource, schema: StructType) -> "DataSourceStreamReader": +""" +Fallback to simpleStreamReader() method when streamReader() is not implemented. +This should be invoked whenever a DataSourceStreamReader needs to be created instead of +invoking datasource.streamReader() directly. +""" +try: +return datasource.streamReader(schema=schema) +except PySparkNotImplementedError: +return _SimpleStreamReaderWrapper(datasource.simpleStreamReader(schema=schema)) + + +class SimpleInputPartition(InputPartition): Review Comment: here https://github.com/apache/spark/tree/master/python/docs/source/user_guide/sql. feel free to do it separately, I don't mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]
HeartSaVioR commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1582466426 ## python/pyspark/sql/datasource_internal.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +import json +import copy +from itertools import chain +from typing import Iterator, List, Optional, Sequence, Tuple + +from pyspark.sql.datasource import ( +DataSource, +DataSourceStreamReader, +InputPartition, +SimpleDataSourceStreamReader, +) +from pyspark.sql.types import StructType +from pyspark.errors import PySparkNotImplementedError + + +def _streamReader(datasource: DataSource, schema: StructType) -> "DataSourceStreamReader": +""" +Fallback to simpleStreamReader() method when streamReader() is not implemented. +This should be invoked whenever a DataSourceStreamReader needs to be created instead of +invoking datasource.streamReader() directly. +""" +try: +return datasource.streamReader(schema=schema) +except PySparkNotImplementedError: +return _SimpleStreamReaderWrapper(datasource.simpleStreamReader(schema=schema)) + + +class SimpleInputPartition(InputPartition): +def __init__(self, start: dict, end: dict): +self.start = start +self.end = end + + +class PrefetchedCacheEntry: +def __init__(self, start: dict, end: dict, iterator: Iterator[Tuple]): +self.start = start +self.end = end +self.iterator = iterator + + +class _SimpleStreamReaderWrapper(DataSourceStreamReader): +""" +A private class that wrap :class:`SimpleDataSourceStreamReader` in prefetch and cache pattern, +so that :class:`SimpleDataSourceStreamReader` can integrate with streaming engine like an +ordinary :class:`DataSourceStreamReader`. + +current_offset tracks the latest progress of the record prefetching, it is initialized to be +initialOffset() when query start for the first time or initialized to be the end offset of +the last committed batch when query restarts. + +When streaming engine calls latestOffset(), the wrapper calls read() that starts from +current_offset, prefetches and cache the data, then updates the current_offset to be +the end offset of the new data. + +When streaming engine call planInputPartitions(start, end), the wrapper get the prefetched data +from cache and send it to JVM along with the input partitions. + +When query restart, batches in write ahead offset log that has not been committed will be +replayed by reading data between start and end offset through read2(start, end). +""" + +def __init__(self, simple_reader: SimpleDataSourceStreamReader): +self.simple_reader = simple_reader +self.initial_offset: Optional[dict] = None +self.current_offset: Optional[dict] = None +self.cache: List[PrefetchedCacheEntry] = [] + +def initialOffset(self) -> dict: +if self.initial_offset is None: +self.initial_offset = self.simple_reader.initialOffset() +return self.initial_offset + +def latestOffset(self) -> dict: +# when query start for the first time, use initial offset as the start offset. Review Comment: Let's avoid randomness - you can manipulate both tests on restarting 1) restarting from the query which does not have leftover batch 2) restarting from the query which does have leftover batch (planned-but-yet-to-be-committed). We have several tests which adjusts offset log and commit log to test the behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1582463921 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala: ## @@ -51,13 +51,30 @@ abstract class QueryStageExec extends LeafExecNode { */ val plan: SparkPlan + /** + * Name of this query stage which is unique in the entire query plan. + */ + val name: String = s"${this.getClass.getSimpleName}-$id" + + /** + * This flag aims to detect if the stage materialization is started. This helps + * to avoid unnecessary stage materialization when the stage is canceled. + */ + private val materializationStarted = new AtomicBoolean() Review Comment: I agree that we need clean API definition so please find new refactoring on this direction by last commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1582463921 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala: ## @@ -51,13 +51,30 @@ abstract class QueryStageExec extends LeafExecNode { */ val plan: SparkPlan + /** + * Name of this query stage which is unique in the entire query plan. + */ + val name: String = s"${this.getClass.getSimpleName}-$id" + + /** + * This flag aims to detect if the stage materialization is started. This helps + * to avoid unnecessary stage materialization when the stage is canceled. + */ + private val materializationStarted = new AtomicBoolean() Review Comment: I agree on this and we need clean API definition so please find new refactoring on this direction by last commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47292][SS] safeMapToJValue should consider null typed values [spark]
HeartSaVioR commented on PR #46260: URL: https://github.com/apache/spark/pull/46260#issuecomment-2081738599 Late +1. Thanks @WweiL ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1582461066 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala: ## @@ -198,18 +215,23 @@ case class ShuffleQueryStageExec( reuse } - override def cancel(): Unit = shuffleFuture match { -case action: FutureAction[MapOutputStatistics] if !action.isCompleted => - action.cancel() -case _ => + override def cancel(): Unit = { Review Comment: Added `doCancel()` to `ExchangeQueryStageExec` and both QueryStageExecs implement it. Please find last commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48033][VARIANT] Fix `RuntimeReplaceable` expressions being used in default columns [spark]
richardc-db opened a new pull request, #46269: URL: https://github.com/apache/spark/pull/46269 ### What changes were proposed in this pull request? Currently, default columns that have a default of a `RuntimeReplaceable` expression fails. This is because the `AlterTableCommand` constant folds before replacing expressions with the actual implementation. For example: ``` sql(s"CREATE TABLE t(v VARIANT DEFAULT parse_json('1')) USING PARQUET") sql("INSERT INTO t VALUES(DEFAULT)") ``` fails because `parse_json` is `RuntimeReplaceable` and is evaluated before the analyzer inserts the correct expression into the plan ### Why are the changes needed? This allows default columns to use expressions that are `RuntimeReplaceable` This is especially important for Variant types because literal variants are difficult to create - `parse_json` will likely be used the majority of the time. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UT ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47292][SS] safeMapToJValue should consider null typed values [spark]
HyukjinKwon closed pull request #46260: [SPARK-47292][SS] safeMapToJValue should consider null typed values URL: https://github.com/apache/spark/pull/46260 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47292][SS] safeMapToJValue should consider null typed values [spark]
HyukjinKwon commented on PR #46260: URL: https://github.com/apache/spark/pull/46260#issuecomment-2081724885 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48014][SQL] Change the makeFromJava error in EvaluatePython to a user-facing error [spark]
HyukjinKwon closed pull request #46250: [SPARK-48014][SQL] Change the makeFromJava error in EvaluatePython to a user-facing error URL: https://github.com/apache/spark/pull/46250 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48014][SQL] Change the makeFromJava error in EvaluatePython to a user-facing error [spark]
HyukjinKwon commented on PR #46250: URL: https://github.com/apache/spark/pull/46250#issuecomment-2081724390 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image [spark]
panbingkun commented on PR #46258: URL: https://github.com/apache/spark/pull/46258#issuecomment-2081715514 cc @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44635][CORE] Handle shuffle fetch failures in decommissions [spark]
github-actions[bot] closed pull request #42296: [SPARK-44635][CORE] Handle shuffle fetch failures in decommissions URL: https://github.com/apache/spark/pull/42296 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [TEST ONLY][SQL] Test resolve column references with PLAN_ID [spark]
github-actions[bot] commented on PR #43115: URL: https://github.com/apache/spark/pull/43115#issuecomment-2081715244 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46744][SPARK-SHELL][SQL][CONNECT][PYTHON][R] Display clear `exit command` for all spark terminal [spark]
github-actions[bot] closed pull request #44769: [SPARK-46744][SPARK-SHELL][SQL][CONNECT][PYTHON][R] Display clear `exit command` for all spark terminal URL: https://github.com/apache/spark/pull/44769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]
HyukjinKwon commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1582440585 ## python/pyspark/sql/datasource.py: ## @@ -469,6 +501,188 @@ def stop(self) -> None: ... +class SimpleInputPartition(InputPartition): +def __init__(self, start: dict, end: dict): +self.start = start +self.end = end + + +class SimpleDataSourceStreamReader(ABC): +""" +A base class for simplified streaming data source readers. Compared to DataSourceStreamReader, +SimpleDataSourceStreamReader doesn't require planning data partitioning. Also, the read api of +SimpleDataSourceStreamReader allows reading data and planning the latest offset at the same time. + +.. versionadded: 4.0.0 +""" + +def initialOffset(self) -> dict: +""" +Return the initial offset of the streaming data source. +A new streaming query starts reading data from the initial offset. +If Spark is restarting an existing query, it will restart from the check-pointed offset +rather than the initial one. + +Returns +--- +dict +A dict or recursive dict whose key and value are primitive types, which includes +Integer, String and Boolean. + +Examples + +>>> def initialOffset(self): +... return {"parititon-1": {"index": 3, "closed": True}, "partition-2": {"index": 5}} +""" +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "initialOffset"}, +) + +def read(self, start: dict) -> (Iterator[Tuple], dict): +""" +Read all available data from specified start offset and return the offset that next read attempt +starts from. + +Parameters +-- +start : dict +The start offset to start reading from. + +Returns +--- +A tuple of an iterator of :class:`Tuple` and a dict\\s +The iterator contains all the available records after start offset. +The dict is the end of this read attempt and the start of next read attempt. +""" +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "read"}, +) + +def read2(self, start: dict, end: dict) -> Iterator[Tuple]: Review Comment: It can't have overloaded ones but it can dispatch by embedding if-else, and leveraging optional argument. e.g., ```python def read(self, start: dict, end: dict = None) -> Union[Tuple[Iterator[Tuple], dict], Iterator[Tuple]: if end is None: return # logic for read(start) else: return # logic for read(start, end) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]
HyukjinKwon closed pull request #46262: [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` URL: https://github.com/apache/spark/pull/46262 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]
HyukjinKwon commented on PR #46262: URL: https://github.com/apache/spark/pull/46262#issuecomment-2081701064 Thanks! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 [spark]
panbingkun opened a new pull request, #46268: URL: https://github.com/apache/spark/pull/46268 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]
dongjoon-hyun commented on PR #46266: URL: https://github.com/apache/spark/pull/46266#issuecomment-2081693301 Looking forward to seeing the result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]
dongjoon-hyun commented on PR #46266: URL: https://github.com/apache/spark/pull/46266#issuecomment-2081693160 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]
neilramaswamy commented on code in PR #44323: URL: https://github.com/apache/spark/pull/44323#discussion_r1582300122 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) -val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) -val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) -expr.map(JoinStateValueWatermarkPredicate.apply _) +// If the condition itself is empty (for example, left_time < left_time + INTERVAL ...), +// then we will not have generated a stateValueWatermark. +if (stateValueWatermark.isEmpty) { + None +} else { + // For example, if the condition is of the form: + //left_time > right_time + INTERVAL 30 MINUTES + // Then this extracts left_time and right_time. + val attributesInCondition = AttributeSet( +condition.get.collect { case a: AttributeReference => a } + ) + + // Construct an AttributeSet so that we can perform equality between attributes, + // which we do in the filter below. + val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes) + + // oneSideInputAttributes could be [left_value, left_time], and we just + // want the attribute _in_ the time-interval condition. + val oneSideStateWatermarkAttributes = attributesInCondition.filter { a => +oneSideInputAttributeSet.contains(a) + } + + // There should be a single attribute per side in the time-interval condition, so, + // filtering for oneSideInputAttributes as done above should lead us with 1 attribute. + if (oneSideStateWatermarkAttributes.size == 1) { +val expr = + watermarkExpression(Some(oneSideStateWatermarkAttributes.head), stateValueWatermark) +expr.map(JoinStateValueWatermarkPredicate.apply _) + } else { +// This should never happen, since the grammar will ensure that we have one attribute Review Comment: Good question. I thought more about this, and I actually think that I might be wrong in the case of an edge-case we don't have in any of our tests: if the user does: `left_time > right_time + m AND other_left_time > right_time + n`, there will be _three_ attributes in the condition. Then, `oneSideStateWatermarkAttributes.size` will be 2 (it will be `left_time` and `other_left_time`, neither of which are watermark attributes), and the condition that we need to return would be a conjunctive watermark predicate: `left_time <= watermark(right) + m AND other_left_time <= watermark(right) + n`. We can remove state, but I'm pretty sure the current implementation in Spark master would fail. I need to check this. It might be out-of-scope for this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-48031] view evolution [spark]
srielau opened a new pull request, #46267: URL: https://github.com/apache/spark/pull/46267 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]
yaooqinn commented on PR #46266: URL: https://github.com/apache/spark/pull/46266#issuecomment-2081486023 Waiting for CI results https://github.com/yaooqinn/spark/actions/runs/8867970355 https://github.com/yaooqinn/spark/actions/runs/8867971118 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]
yaooqinn opened a new pull request, #46266: URL: https://github.com/apache/spark/pull/46266 ### What changes were proposed in this pull request? This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on. The latest baseline has been updated by SPARK-47513. ### Why are the changes needed? SPARK-4 turns ANSI on by default, there could be performance related issues ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]
advancedxy commented on PR #46265: URL: https://github.com/apache/spark/pull/46265#issuecomment-2081471602 @sunchao, @szehon-ho and @yabola would you mind to take a look at this and help review this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]
advancedxy commented on code in PR #46265: URL: https://github.com/apache/spark/pull/46265#discussion_r1582148061 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/InternalRowComparableWrapper.scala: ## @@ -53,6 +55,21 @@ class InternalRowComparableWrapper(val row: InternalRow, val dataTypes: Seq[Data } object InternalRowComparableWrapper { + private final val MAX_CACHE_ENTRIES = 1024 Review Comment: 1024 should be sufficient, it could be a SQL configuration though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]
advancedxy commented on PR #46265: URL: https://github.com/apache/spark/pull/46265#issuecomment-2081471221 Before applying changes in this PR, the benchmark code(in this PR) took: ``` [info] OpenJDK 64-Bit Server VM 21.0.2 on Mac OS X 14.4.1 [info] Apple M1 [info] internal row comparable wrapper: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] [info] toSet 10785 11369 825 0.0 53925.3 1.0X [info] mergePartitions 21073 21447 528 0.0 105364.5 0.5X ``` After this PR: ``` [info] OpenJDK 64-Bit Server VM 21.0.2 on Mac OS X 14.4.1 [info] Apple M1 [info] internal row comparable wrapper: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] [info] toSet85 103 25 2.4 425.0 1.0X [info] mergePartitions 163 167 4 1.2 816.3 0.5X ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]
advancedxy opened a new pull request, #46265: URL: https://github.com/apache/spark/pull/46265 ### What changes were proposed in this pull request? Cache rowOrdering and structType for InternalRowComparableWrapper ### Why are the changes needed? For performance issues ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added a new benchmark to verify the performance improvement ### Was this patch authored or co-authored using generative AI tooling? NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
beliefer commented on code in PR #46263: URL: https://github.com/apache/spark/pull/46263#discussion_r1582095704 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. // We assume other rules have already pushed predicates through join if possible. // So the predicate references won't pass on anymore. if (left.output.exists(_.semanticEquals(targetKey))) { - extract(left, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, -currentPlan = left, targetKey = targetKey).orElse { -// We can also extract from the right side if the join keys are transitive. + val extractLeft = if (canExtractLeft(joinType)) { +extract(left, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, + currentPlan = left, targetKey = targetKey) + } else { +None + } + val extractRight = if (canExtractRight(joinType)) { lkeys.zip(rkeys).find(_._1.semanticEquals(targetKey)).map(_._2) .flatMap { newTargetKey => extract(right, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, currentPlan = right, targetKey = newTargetKey) } + } else { +None } + extractLeft.orElse(extractRight) } else if (right.output.exists(_.semanticEquals(targetKey))) { - extract(right, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, -currentPlan = right, targetKey = targetKey).orElse { + val extractRight = if (canExtractRight(joinType)) { +extract(right, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, + currentPlan = right, targetKey = targetKey) + } else { +None + } + val extractLeft = if (canExtractLeft(joinType)) { // We can also extract from the left side if the join keys are transitive. rkeys.zip(lkeys).find(_._1.semanticEquals(targetKey)).map(_._2) .flatMap { newTargetKey => extract(left, AttributeSet.empty, hasHitFilter = false, hasHitSelectiveFilter = false, currentPlan = left, targetKey = newTargetKey) } + } else { +None } + extractRight.orElse(extractLeft) Review Comment: ditto ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -86,6 +87,17 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J private def extractSelectiveFilterOverScan( plan: LogicalPlan, filterCreationSideKey: Expression): Option[(Expression, LogicalPlan)] = { + +def canExtractRight(joinType: JoinType): Boolean = joinType match { + case Inner | LeftSemi | RightOuter => true + case _ => false +} + +def canExtractLeft(joinType: JoinType): Boolean = joinType match { + case Inner | LeftSemi | LeftOuter | LeftAnti => true + case _ => false +} Review Comment: Shall we add the two methods into `joins.scala`? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -120,34 +132,49 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J hasHitSelectiveFilter = hasHitSelectiveFilter || isLikelySelective(condition), currentPlan, targetKey) - case ExtractEquiJoinKeys(_, lkeys, rkeys, _, _, left, right, _) => + case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) => // Runtime filters use one side of the [[Join]] to build a set of join key values and prune // the other side of the [[Join]]. It's also OK to use a superset of the join key values // (ignore null values) to do the pruning. // We assume other rules have already pushed predicates through join if possible. // So the predicate references won't pass on anymore. if (left.output.exists(_.semanticEquals(targetKey))) { - extract(left,
[PR] [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework [spark]
panbingkun opened a new pull request, #46264: URL: https://github.com/apache/spark/pull/46264 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates [spark]
beliefer opened a new pull request, #42223: URL: https://github.com/apache/spark/pull/42223 ### What changes were proposed in this pull request? Some queries contains multiple scalar subquery(aggregation without group by clause) and connected with join. The general form of joined aggregates that can be merged as follows: ``` ::= SUM | AVG | MAX | ... ::= SELECT (...)[ , (...)[ , ...]] FROM [tab | query] ::= SELECT * FROM ( [INNER | CROSS | LEFT | RIGHT | FULL OUTER] JOIN [INNER | CROSS | LEFT | RIGHT | FULL OUTER] JOIN ... [INNER | CROSS | LEFT | RIGHT | FULL OUTER] JOIN ) ``` For example, ``` SELECT * FROM (SELECT avg(power) avg_power, count(power) count_power, count(DISTINCT power) count_distinct_power FROM data WHERE country = "USA" AND (id BETWEEN 1 AND 3 OR city = "Berkeley" OR name = "Xiao")) B1, (SELECT avg(power) avg_power, count(power) count_power, count(DISTINCT power) count_distinct_power FROM data WHERE country = "China" AND (id BETWEEN 4 AND 5 OR city = "Hangzhou" OR name = "Wenchen")) B2 ``` We can optimize this SQL to the form shown below: ``` SELECT avg(power) avg_power FILTER (country = "USA" AND (id BETWEEN 1 AND 3 OR city = "Berkeley" OR name = "Xiao")), count(power) count_power FILTER (country = "USA" AND (id BETWEEN 1 AND 3 OR city = "Berkeley" OR name = "Xiao")), count(DISTINCT power) FILTER (country = "USA" AND (id BETWEEN 1 AND 3 OR city = "Berkeley" OR name = "Xiao")), avg(power) FILTER (country = "China" AND (id BETWEEN 4 AND 5 OR city = "Hangzhou" OR name = "Wenchen")), count(power) FILTER (country = "China" AND (id BETWEEN 4 AND 5 OR city = "Hangzhou" OR name = "Wenchen")), count(DISTINCT power) FILTER (country = "China" AND (id BETWEEN 4 AND 5 OR city = "Hangzhou" OR name = "Wenchen")) FROM data WHERE (country = "USA" AND (id BETWEEN 1 AND 3 OR city = "Berkeley" OR name = "Xiao")) OR (country = "China" AND (id BETWEEN 4 AND 5 OR city = "Hangzhou" OR name = "Wenchen")) ``` If we can merge the filters and aggregates, we can scan data source only once and eliminate the join so as avoid shuffle. This PR also supports eliminate nested Join, please refer to: https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q28.sql Obviously, this change will improve the performance. This PR also reuse some functions come from `MergeScalarSubqueries`. This PR also add some `TreePattern` for easy to check the cost of predicate. ### Why are the changes needed? Improve the performance for the case show above. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? 1. new test cases 2. new micro benchmark. ``` Benchmark CombineJoinedAggregates: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - filter is not defined, CombineJoinedAggregates: false730 819 69 28.7 34.8 1.0X filter is not defined, CombineJoinedAggregates: true 618 632 14 33.9 29.5 1.2X step is 100, CombineJoinedAggregates: false 572 590 20 36.7 27.3 1.3X step is 100, CombineJoinedAggregates: true) 769 794 21 27.3 36.6 1.0X step is 10, CombineJoinedAggregates: false 350 370 26 59.9 16.7 2.1X step is 10, CombineJoinedAggregates: true) 231 241 10 90.7 11.0 3.2X step is 1, CombineJoinedAggregates: false314 340 26 66.8 15.0 2.3X step is 1, CombineJoinedAggregates: true)171 182 9122.5 8.2 4.3X step is 1000, CombineJoinedAggregates: false 303 337 32 69.3 14.4 2.4X step is 1000, CombineJoinedAggregates: true) 162 171 9129.4 7.7 4.5X step is 100, CombineJoinedAggregates: false 300 316 27 70.0 14.3 2.4X step is 100, CombineJoinedAggregates: true) 160 169 9131.3 7.6 4.6X step is 10,
[PR] [SPARK-48027][SQL] InjectRuntimeFilter for multi-level join should check child join type [spark]
AngersZh opened a new pull request, #46263: URL: https://github.com/apache/spark/pull/46263 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn commented on PR #46261: URL: https://github.com/apache/spark/pull/46261#issuecomment-2081416283 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn closed pull request #46261: [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark URL: https://github.com/apache/spark/pull/46261 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn commented on PR #46261: URL: https://github.com/apache/spark/pull/46261#issuecomment-2081402895 Thank you @HyukjinKwon. I have dispatched a CI task to audit, https://github.com/yaooqinn/spark/actions/runs/8866497258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]
cloud-fan commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2081399740 is the end goal to automatically update table statistics? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]
cloud-fan commented on code in PR #46188: URL: https://github.com/apache/spark/pull/46188#discussion_r1582063564 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala: ## @@ -223,6 +278,9 @@ class BasicWriteJobStatsTracker( val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, driverSideMetrics.values.toList) + +SQLPartitionMetrics.postDriverMetricUpdates(sparkContext, executionId, Review Comment: These new partition metrics are not shown in the Spark UI, but a listener event that third-party libraries can consume? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]
HyukjinKwon commented on PR #46262: URL: https://github.com/apache/spark/pull/46262#issuecomment-2081397805 cc @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48024][PYTHON][CONNECT][TESTS] Enable `UDFParityTests.test_udf_timestamp_ntz` [spark]
HyukjinKwon closed pull request #46257: [SPARK-48024][PYTHON][CONNECT][TESTS] Enable `UDFParityTests.test_udf_timestamp_ntz` URL: https://github.com/apache/spark/pull/46257 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48024][PYTHON][CONNECT][TESTS] Enable `UDFParityTests.test_udf_timestamp_ntz` [spark]
HyukjinKwon commented on PR #46257: URL: https://github.com/apache/spark/pull/46257#issuecomment-2081397408 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Remove `pyspark.sql.classic` reference in `pyspark.ml.stat` [spark]
HyukjinKwon opened a new pull request, #46262: URL: https://github.com/apache/spark/pull/46262 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/46155 that removes the reference of `_to_seq` that `pyspark-connect` package does not have. ### Why are the changes needed? To recover the CI https://github.com/apache/spark/actions/runs/8821919392/job/24218893631 ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48002][PYTHON][SS] Add test for observed metrics in PySpark StreamingQueryListener [spark]
HyukjinKwon closed pull request #46237: [SPARK-48002][PYTHON][SS] Add test for observed metrics in PySpark StreamingQueryListener URL: https://github.com/apache/spark/pull/46237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48002][PYTHON][SS] Add test for observed metrics in PySpark StreamingQueryListener [spark]
HyukjinKwon commented on PR #46237: URL: https://github.com/apache/spark/pull/46237#issuecomment-2081395666 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48021][ML][BUILD][FOLLOWUP] add `--add-modules=jdk.incubator.vector` to maven compile args [spark]
LuciferYang commented on PR #46259: URL: https://github.com/apache/spark/pull/46259#issuecomment-2081389478 Merged into master for Spark 4.0. Thanks @panbingkun and @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48021][ML][BUILD][FOLLOWUP] add `--add-modules=jdk.incubator.vector` to maven compile args [spark]
LuciferYang closed pull request #46259: [SPARK-48021][ML][BUILD][FOLLOWUP] add `--add-modules=jdk.incubator.vector` to maven compile args URL: https://github.com/apache/spark/pull/46259 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn commented on code in PR #46261: URL: https://github.com/apache/spark/pull/46261#discussion_r1582051605 ## sql/core/benchmarks/DateTimeBenchmark-jdk21-results.txt: ## @@ -2,460 +2,460 @@ datetime +/- interval -OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure +OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure AMD EPYC 7763 64-Core Processor datetime +/- interval:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -date + interval(m) 850887 33 11.8 85.0 1.0X -date + interval(m, d) 863864 2 11.6 86.3 1.0X -date + interval(m, d, ms) 3507 3511 5 2.9 350.7 0.2X -date - interval(m) 841851 9 11.9 84.1 1.0X -date - interval(m, d) 864870 5 11.6 86.4 1.0X -date - interval(m, d, ms) 3518 3519 2 2.8 351.8 0.2X -timestamp + interval(m)1756 1759 5 5.7 175.6 0.5X -timestamp + interval(m, d) 1802 1805 4 5.5 180.2 0.5X -timestamp + interval(m, d, ms) 1958 1961 4 5.1 195.8 0.4X -timestamp - interval(m)1744 1745 2 5.7 174.4 0.5X -timestamp - interval(m, d) 1796 1799 4 5.6 179.6 0.5X -timestamp - interval(m, d, ms) 1944 1947 5 5.1 194.4 0.4X +date + interval(m) 1149 1158 12 8.7 114.9 1.0X +date + interval(m, d) 1136 1137 1 8.8 113.6 1.0X +date + interval(m, d, ms) 3779 3799 29 2.6 377.9 0.3X +date - interval(m) 1113 1116 4 9.0 111.3 1.0X +date - interval(m, d) 1124 1141 25 8.9 112.4 1.0X +date - interval(m, d, ms) 3795 3796 1 2.6 379.5 0.3X +timestamp + interval(m)1528 1530 3 6.5 152.8 0.8X +timestamp + interval(m, d) 1581 1585 6 6.3 158.1 0.7X +timestamp + interval(m, d, ms) 2037 2044 10 4.9 203.7 0.6X +timestamp - interval(m)1786 1790 6 5.6 178.6 0.6X +timestamp - interval(m, d) 1865 1872 10 5.4 186.5 0.6X +timestamp - interval(m, d, ms) 2038 2054 23 4.9 203.8 0.6X Extract components -OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure +OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure AMD EPYC 7763 64-Core Processor cast to timestamp:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -cast to timestamp wholestage off209209 0 47.9 20.9 1.0X -cast to timestamp wholestage on 209225 15 47.8 20.9 1.0X +cast to timestamp wholestage off192198 9 52.2 19.2 1.0X +cast to timestamp wholestage on 206213 6 48.5 20.6 0.9X -OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure +OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure AMD EPYC 7763 64-Core Processor year
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn commented on code in PR #46261: URL: https://github.com/apache/spark/pull/46261#discussion_r1582049065 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala: ## @@ -161,7 +167,7 @@ object DateTimeBenchmark extends SqlBasedBenchmark { } val dateExpr = "cast(timestamp_seconds(id) as date)" Seq("year", "", "yy", "mon", "month", "mm").foreach { level => -run(N, s"trunc $level", s"trunc('$level', $dateExpr)") +run(N, s"trunc $level", s"trunc($dateExpr, '$level')") Review Comment: the parameter order is wrong -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn commented on code in PR #46261: URL: https://github.com/apache/spark/pull/46261#discussion_r1582049184 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala: ## @@ -171,7 +177,7 @@ object DateTimeBenchmark extends SqlBasedBenchmark { run(n, "to timestamp str", timestampStrExpr) run(n, "to_timestamp", s"to_timestamp($timestampStrExpr, $pattern)") run(n, "to_unix_timestamp", s"to_unix_timestamp($timestampStrExpr, $pattern)") - val dateStrExpr = "concat('2019-01-', lpad(mod(id, 25), 2, '0'))" + val dateStrExpr = "concat('2019-01-', lpad(mod(id, 25) + 1, 2, '0'))" Review Comment: avoid invalid date value `2019-01-00` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn commented on code in PR #46261: URL: https://github.com/apache/spark/pull/46261#discussion_r1582048982 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeBenchmark.scala: ## @@ -75,7 +81,7 @@ object DateTimeBenchmark extends SqlBasedBenchmark { doBenchmark(N, s"$dt + interval 1 month 2 day") } benchmark.addCase("date + interval(m, d, ms)") { _ => -doBenchmark(N, s"$dt + interval 1 month 2 day 5 hour") +doBenchmarkAnsiOff(N, s"$dt + interval 1 month 2 day 5 hour") Review Comment: illegal hour portion for ansi date add -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48025][SQL][TESTS] Fix org.apache.spark.sql.execution.benchmark.DateTimeBenchmark [spark]
yaooqinn opened a new pull request, #46261: URL: https://github.com/apache/spark/pull/46261 ### What changes were proposed in this pull request? This PR fixes several issues in org.apache.spark.sql.execution.benchmark.DateTimeBenchmark - Misuse `trunc` function, a.k.a, the parameter order is wrong in reverse order - Some benchmarks not compatible with ANSI by default ### Why are the changes needed? restore benchmark cases ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? benchmark ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org