Re: [PR] [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46390: URL: https://github.com/apache/spark/pull/46390#discussion_r1591851400 ## core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java: ## @@ -17,21 +17,22 @@ package org.apache.spark.util.collection.unsafe.sort; +import java.io.*; Review Comment: let revert the unnecessary changes in this file ## core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java: ## @@ -17,21 +17,22 @@ package org.apache.spark.util.collection.unsafe.sort; +import java.io.*; Review Comment: let's revert the unnecessary changes in this file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48049][BUILD] Upgrade Scala to 2.13.14 [spark]
LuciferYang commented on PR #46288: URL: https://github.com/apache/spark/pull/46288#issuecomment-2097501153 I think we need wait a new Ammonite release version, as 3.0-M1 does not support Scala 2.13.14. ![image](https://github.com/apache/spark/assets/1475305/675a2da1-7c7b-476b-88e0-31d9086f6701) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48090][SS][PYTHON][TESTS] Shorten the traceback in the test checking error message in UDF [spark]
HyukjinKwon opened a new pull request, #46426: URL: https://github.com/apache/spark/pull/46426 ### What changes were proposed in this pull request? This PR reduces traceback so the actual error `ZeroDivisionError` can be tested in `pyspark.sql.tests.connect.streaming.test_parity_streaming.StreamingParityTests.test_stream_exception` ### Why are the changes needed? So long traceback doesn't affect the test case. It can fail as below: ``` == FAIL [1.883s]: test_stream_exception (pyspark.sql.tests.connect.streaming.test_parity_streaming.StreamingParityTests.test_stream_exception) -- Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 287, in test_stream_exception sq.processAllAvailable() File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", line 129, in processAllAvailable self._execute_streaming_query_cmd(cmd) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", line 177, in _execute_streaming_query_cmd (_, properties) = self._session.client.execute_command(exec_cmd) ^^ File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 982, in execute_command data, _, _, _, properties = self._execute_and_fetch(req) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1283, in _execute_and_fetch for response in self._execute_and_fetch_as_iterator(req): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1264, in _execute_and_fetch_as_iterator self._handle_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line [150](https://github.com/HyukjinKwon/spark/actions/runs/8978991632/job/24660689666#step:9:151)3, in _handle_error self._handle_rpc_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1539, in _handle_rpc_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.StreamingQueryException: [STREAM_FAILED] Query [id = 1c0c440d-0b48-41b1-9a03-071e7e13de82, runId = 692ec338-963a-43b1-89cb-2a8b7cb1e21a] terminated with exception: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 58) (fv-az714-234.22nzjvkrszmuhkvqy55p1tioig.phxx.internal.cloudapp.net executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 224, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 145, in dump_stream for obj in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 213, in _batched for item in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1734, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in udfs) ^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1734, in result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in udfs) ^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 112, in return args_kwargs_offsets, lambda *a: func(*a) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 134, in wrapper return f(*args, **kwargs) ^^ File "/home/runner/work/spark/spark-3 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 291, in test_stream_exception self._assert_exception_tree_contains_msg(e, "ZeroDivisionError") File
Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]
dongjoon-hyun commented on PR #46273: URL: https://github.com/apache/spark/pull/46273#issuecomment-2097474800 > Looks like this missed the 3.4 release by a month @dongjoon-hyun ... might have been a nice addition to it ! Ya, I agree and this is still a good addition to 4.0.0-preview. Cc @cloud-fan as the release manager of 4.0.0-preview -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]
gengliangwang commented on PR #46386: URL: https://github.com/apache/spark/pull/46386#issuecomment-2097446465 @mridulm As the task name MDC is frequently showing in the logs; I would say this is necessary for the new logging framework. After the renaming, the MDC names are consistent and simpler to use. Otherwise, it will be `executor_id`/`worker_id`/`task_id`/etc and an odd `mdc.taskName`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]
mridulm commented on PR #46273: URL: https://github.com/apache/spark/pull/46273#issuecomment-2097438661 Looks like this missed the 3.4 release @dongjoon-hyun ... might have been a nice addition to it ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]
mridulm commented on PR #46025: URL: https://github.com/apache/spark/pull/46025#issuecomment-2097435499 `reader.completed` is checking for the `IN_PROGRESS` suffix - which will be the case here @bluzy: and so with this PR, it will never clean up those files. (Some users/deployments have out of band workflows that do the `move` to remove the `IN_PROGRESS` suffix - but they are not part of Apache Spark itself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]
dongjoon-hyun commented on PR #46425: URL: https://github.com/apache/spark/pull/46425#issuecomment-2097432716 Merged to master for Apache Spark 4.0.0-preview. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]
dongjoon-hyun closed pull request #46425: [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` URL: https://github.com/apache/spark/pull/46425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]
mridulm commented on PR #46386: URL: https://github.com/apache/spark/pull/46386#issuecomment-2097430754 Is this change strictly necessary would be the question ... if it is, we can evaluate it in that context. If not and is a nice to have, it is better not to make breaking changes which can impact users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]
dongjoon-hyun commented on PR #46425: URL: https://github.com/apache/spark/pull/46425#issuecomment-2097420985 Thank you, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests [spark]
HyukjinKwon closed pull request #46420: [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests URL: https://github.com/apache/spark/pull/46420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests [spark]
HyukjinKwon commented on PR #46420: URL: https://github.com/apache/spark/pull/46420#issuecomment-2097420760 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]
dongjoon-hyun commented on PR #46425: URL: https://github.com/apache/spark/pull/46425#issuecomment-2097417035 Thank you, @HyukjinKwon . I believe this will give a proper and better visibility to this test case issue before `Connect GA`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]
dongjoon-hyun opened a new pull request, #46425: URL: https://github.com/apache/spark/pull/46425 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]
zhengruifeng commented on PR #46418: URL: https://github.com/apache/spark/pull/46418#issuecomment-2097401736 thanks @dongjoon-hyun and @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]
zhengruifeng commented on PR #46418: URL: https://github.com/apache/spark/pull/46418#issuecomment-2097401503 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]
zhengruifeng closed pull request #46418: [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` URL: https://github.com/apache/spark/pull/46418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Add collation support for variant expressions [spark]
uros-db opened a new pull request, #46424: URL: https://github.com/apache/spark/pull/46424 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Format expressions [spark]
uros-db opened a new pull request, #46423: URL: https://github.com/apache/spark/pull/46423 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Add collation support for hash expressions [spark]
uros-db opened a new pull request, #46422: URL: https://github.com/apache/spark/pull/46422 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Eliminate unnecessary COLLATE expressions in query analysis [spark]
uros-db opened a new pull request, #46421: URL: https://github.com/apache/spark/pull/46421 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48083][SPARK-48084][ML][TESTS] Remove JIRA comments for reenabling ML compatibility tests [spark]
WeichenXu123 commented on code in PR #46419: URL: https://github.com/apache/spark/pull/46419#discussion_r1591778164 ## python/pyspark/ml/tests/connect/test_connect_evaluation.py: ## @@ -20,7 +20,6 @@ from pyspark.sql import SparkSession from pyspark.ml.tests.connect.test_legacy_mode_evaluation import EvaluationTestsMixin -# TODO(SPARK-48084): Reenable this test case Review Comment: This test error `pyspark.ml.connect.evaluation not working in 3.5 client <> 4.0 server` is caused by `cloudpickle` forward incompatibility ## python/pyspark/ml/tests/connect/test_connect_evaluation.py: ## @@ -20,7 +20,6 @@ from pyspark.sql import SparkSession from pyspark.ml.tests.connect.test_legacy_mode_evaluation import EvaluationTestsMixin -# TODO(SPARK-48084): Reenable this test case Review Comment: This test error `pyspark.ml.connect.evaluation not working in 3.5 client <> 4.0 server` is caused by `cloudpickle` forward incompatibility, it is not related to ML code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48083][SPARK-48084][ML][TESTS] Remove JIRA comments for reenabling ML compatibility tests [spark]
WeichenXu123 commented on code in PR #46419: URL: https://github.com/apache/spark/pull/46419#discussion_r1591777525 ## python/pyspark/ml/tests/connect/test_connect_classification.py: ## @@ -21,7 +21,6 @@ from pyspark.sql import SparkSession from pyspark.ml.tests.connect.test_legacy_mode_classification import ClassificationTestsMixin -# TODO(SPARK-48083): Reenable this test case Review Comment: `session.copyFromLocalToFs failure with 3.5 client <> 4.0 server` this is not an issue, `copyFromLocalToFs` requires spark server to config `spark.connect.copyFromLocalToFs.allowDestLocal` to False, because the test can only use local fs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests [spark]
HyukjinKwon opened a new pull request, #46420: URL: https://github.com/apache/spark/pull/46420 ### What changes were proposed in this pull request? This PR improves the test failure error message in `StreamingParityTests` ### Why are the changes needed? To see the original exception, and to make it easier to debug. Now it's difficult, see https://github.com/HyukjinKwon/spark/actions/runs/8978991632/job/24660689666#step:9:206 ``` Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 291, in test_stream_exception self._assert_exception_tree_contains_msg(e, "ZeroDivisionError") File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 300, in _assert_exception_tree_contains_msg self._assert_exception_tree_contains_msg_connect(exception, msg) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 305, in _assert_exception_tree_contains_msg_connect self.assertTrue( AssertionError: False is not true : Exception tree doesn't contain the expected message: ZeroDivisionError ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser [spark]
gene-db commented on code in PR #46400: URL: https://github.com/apache/spark/pull/46400#discussion_r1591776687 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala: ## @@ -67,16 +67,23 @@ case class PartialResultArrayException( extends Exception(cause) /** - * Exception thrown when the underlying parser meet a bad record and can't parse it. + * Exception thrown when the underlying parser meets a bad record and can't parse it. Used for + * control flow between wrapper and underlying parser without overhead of creating a full exception. * @param record a function to return the record that cause the parser to fail * @param partialResults a function that returns an row array, which is the partial results of * parsing this bad record. - * @param cause the actual exception about why the record is bad and can't be parsed. + * @param cause the actual exception about why the record is bad and can't be parsed. It's better + * to use lightweight exception here (e.g. without stacktrace), and throw an + * adequate (user-facing) exception only in case it's necessary after + * unwrapping `BadRecordException` */ case class BadRecordException( @transient record: () => UTF8String, @transient partialResults: () => Array[InternalRow] = () => Array.empty[InternalRow], -cause: Throwable) extends Exception(cause) +cause: Throwable) extends Exception(cause) { Review Comment: Yes, we can change `parse_json` exception behavior. What is the desired behavior? cc @chenhao-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free [spark]
dongjoon-hyun closed pull request #46399: [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free URL: https://github.com/apache/spark/pull/46399 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48083][SPARK-48084][ML][TESTS] Remove JIRAs for reenabling ML compatibility tests [spark]
HyukjinKwon commented on PR #46419: URL: https://github.com/apache/spark/pull/46419#issuecomment-2097387986 Would you mind describing why those skips are legitimate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]
zhengruifeng commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591773238 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -283,6 +283,16 @@ class Dataset[T] private[sql] ( def printSchema(level: Int): Unit = println(schema.treeString(level)) // scalastyle:on println + /** + * Returns an approximate size in bytes of the Dataset. + * + * @group basic + * @since 4.0.0 + */ + def sizeInBytesApproximation(): Long = { Review Comment: It seems the size may exceed `Long.MaxValue`: https://github.com/apache/spark/blob/111529e0dea68bf5343cc6aabba53b59e5d21830/core/src/main/scala/org/apache/spark/util/Utils.scala#L1147-L1159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]
dongjoon-hyun commented on code in PR #46418: URL: https://github.com/apache/spark/pull/46418#discussion_r1591773149 ## python/pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py: ## @@ -20,10 +20,11 @@ from pyspark.testing.connectutils import ReusedConnectTestCase -class PandasUDFGroupedAggParityTests(GroupedAggPandasUDFTestsMixin, ReusedConnectTestCase): -@unittest.skip("Spark Connect does not support convert UNPARSED to catalyst types.") Review Comment: Just a question. Which JIRA issue supports this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48083] [SPARK-48084] Enable spark ml test [spark]
WeichenXu123 opened a new pull request, #46419: URL: https://github.com/apache/spark/pull/46419 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]
zhengruifeng opened a new pull request, #46418: URL: https://github.com/apache/spark/pull/46418 ### What changes were proposed in this pull request? Enable `PandasUDFGroupedAggParityTests.test_manual` ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]
HeartSaVioR commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1591767311 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -347,6 +347,28 @@ class IncrementalExecution( eventTimeWatermarkForEviction = inputWatermarkForEviction(m.stateInfo.get) ) + // UpdateEventTimeColumnExec is used to tag the eventTime column, and validate + // emitted rows adhere to watermark in the output of transformWithState. + // Hence, this node shares the same watermark value as TransformWithStateExec. + // However, given that UpdateEventTimeColumnExec does not store any state, it + // does not have any StateInfo. We simply use the StateInfo of transformWithStateExec + // to propagate watermark to both UpdateEventTimeColumnExec and transformWithStateExec. + case UpdateEventTimeColumnExec(eventTime, delay, None, +SerializeFromObjectExec(serializer, +t: TransformWithStateExec)) if t.stateInfo.isDefined => + +val stateInfo = t.stateInfo.get +val eventTimeWatermarkForLateEvents = inputWatermarkForLateEvents(stateInfo) +val eventTimeWatermarkForEviction = inputWatermarkForLateEvents(stateInfo) + +UpdateEventTimeColumnExec(eventTime, delay, eventTimeWatermarkForEviction, Review Comment: Note that this would break a test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]
HeartSaVioR commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1591752313 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -347,6 +347,28 @@ class IncrementalExecution( eventTimeWatermarkForEviction = inputWatermarkForEviction(m.stateInfo.get) ) + // UpdateEventTimeColumnExec is used to tag the eventTime column, and validate + // emitted rows adhere to watermark in the output of transformWithState. + // Hence, this node shares the same watermark value as TransformWithStateExec. + // However, given that UpdateEventTimeColumnExec does not store any state, it + // does not have any StateInfo. We simply use the StateInfo of transformWithStateExec + // to propagate watermark to both UpdateEventTimeColumnExec and transformWithStateExec. + case UpdateEventTimeColumnExec(eventTime, delay, None, +SerializeFromObjectExec(serializer, +t: TransformWithStateExec)) if t.stateInfo.isDefined => + +val stateInfo = t.stateInfo.get +val eventTimeWatermarkForLateEvents = inputWatermarkForLateEvents(stateInfo) +val eventTimeWatermarkForEviction = inputWatermarkForLateEvents(stateInfo) + +UpdateEventTimeColumnExec(eventTime, delay, eventTimeWatermarkForEviction, Review Comment: eventTimeWatermarkForLateEvents? ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala: ## @@ -442,6 +442,16 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { case EventTimeWatermark(columnName, delay, child) => EventTimeWatermarkExec(columnName, delay, planLater(child)) :: Nil + case UpdateEventTimeWatermarkColumn(columnName, delay, child) => +// we expect watermarkDelay to be resolved before physical planning. +if (delay.isEmpty) { + // This is a sanity check. We should not reach here as delay is updated during + // query plan resolution in [[ResolveUpdateEventTimeWatermarkColumn]] Analyzer rule. + throw SparkException.internalError( +"You hit a query analyzer bug. Please report your query to Spark user mailing list.") Review Comment: This error message is basically provided for all internal errors. Could we please add one-liner high-level context here if we have? I guess it'd be just OK with stack trace, so 2 cents. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala: ## @@ -107,25 +109,70 @@ case class EventTimeWatermarkExec( } // Update the metadata on the eventTime column to include the desired delay. - override val output: Seq[Attribute] = child.output.map { a => -if (a semanticEquals eventTime) { - val updatedMetadata = new MetadataBuilder() -.withMetadata(a.metadata) -.putLong(EventTimeWatermark.delayKey, delayMs) -.build() - a.withMetadata(updatedMetadata) -} else if (a.metadata.contains(EventTimeWatermark.delayKey)) { - // Remove existing watermark - val updatedMetadata = new MetadataBuilder() -.withMetadata(a.metadata) -.remove(EventTimeWatermark.delayKey) -.build() - a.withMetadata(updatedMetadata) -} else { - a -} + override val output: Seq[Attribute] = { +val delayMs = EventTimeWatermark.getDelayMs(delay) +updateEventTimeColumn(child.output, delayMs, eventTime) } override protected def withNewChildInternal(newChild: SparkPlan): EventTimeWatermarkExec = copy(child = newChild) } + +/** + * Updates the event time column to [[eventTime]] in the child output. + * Any watermark calculations performed after this node will use the + * updated eventTimeColumn. + * + * This node also ensures that output emitted by the child node adheres + * to watermark. If the child node emits rows which are older than global + * watermark, the node will throw an query execution error and fail the user + * query. + */ +case class UpdateEventTimeColumnExec( +eventTime: Attribute, +delay: CalendarInterval, +eventTimeWatermarkForEviction: Option[Long], +child: SparkPlan) extends UnaryExecNode { + + override protected def doExecute(): RDD[InternalRow] = { +child.execute().mapPartitions[InternalRow] { dataIterator => + val watermarkExpression = WatermarkSupport.watermarkExpression( +Some(eventTime), eventTimeWatermarkForEviction) + + if (watermarkExpression.isEmpty) { +// watermark should always be defined in this node. +throw QueryExecutionErrors.cannotGetEventTimeWatermarkError() + } + + val predicate = Predicate.create(watermarkExpression.get, child.output) + new Iterator[InternalRow] { +override def hasNext: Boolean = dataIterator.hasNext + +override def next(): InternalRow = { + val row = dataIterator.next()
Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]
panbingkun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591763188 ## connector/profiler/pom.xml: ## @@ -44,7 +44,7 @@ me.bechberger ap-loader-all - 3.0-8 + 3.0-9 Review Comment: Okay -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]
dongjoon-hyun commented on PR #46409: URL: https://github.com/apache/spark/pull/46409#issuecomment-2097363087 Merged to master for Apache Spark 4.0.0-preview. Thank you, @JoshRosen and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]
dongjoon-hyun closed pull request #46409: [SPARK-48150][SQL] try_parse_json output should be declared as nullable URL: https://github.com/apache/spark/pull/46409 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]
chloeh13q opened a new pull request, #46416: URL: https://github.com/apache/spark/pull/46416 ### What changes were proposed in this pull request? The docstrings of the pyspark DataStream Reader methods `csv()` and `text()` say that the `path` parameter can be a list, but actually when a list is passed an error is raised. ### Why are the changes needed? Documentation is wrong. ### Does this PR introduce _any_ user-facing change? Yes. Fixes documentation. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48105][SS][3.5] Fix the race condition between state store unloading and snapshotting [spark]
huanliwang-db opened a new pull request, #46415: URL: https://github.com/apache/spark/pull/46415 * When we close the hdfs state store, we should only remove the entry from `loadedMaps` rather than doing the active data cleanup. JVM GC should be able to help us GC those objects. * we should wait for the maintenance thread to stop before unloading the providers. There are two race conditions between state store snapshotting and state store unloading which could result in query failure and potential data corruption. Case 1: 1. the maintenance thread pool encounters some issues and call the [stopMaintenanceTask,](https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774) this function further calls [threadPool.stop.](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587) However, this function doesn't wait for the stop operation to be completed and move to do the state store [unload and clear.](https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778) 2. the provider unload will [close the state store](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721) which [clear the values of loadedMaps](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355) for HDFS backed state store. 3. if the not-yet-stop maintenance thread is still running and trying to do the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has been removed. if this snapshot process completes successfully, then we will write corrupted data and the following batches will consume this corrupted data. Case 2: 1. In executor_1, the maintenance thread is going to do the snapshot for state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the loadedMaps, after this, the maintenance thread [releases the lock of the loadedMaps](https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751). 2. state_store_1 is loaded in another executor, e.g. executor_2. 3. another state store, state_store_2, is loaded on executor_1 and [reportActiveStoreInstance](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871) to driver. 4. executor_1 does the [unload](https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713) for those no longer active state store which clears the data entries in the `HDFSBackedStateStoreMap` 5. the snapshotting thread is terminated and uploads the incomplete snapshot to cloud because the [iterator doesn't have next element](https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634) after doing the clear. 6. future batches are consuming the corrupted data. No ``` [info] Run completed in 2 minutes, 55 seconds. [info] Total number of tests run: 153 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 153, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 271 s (04:31), completed May 2, 2024, 6:26:33 PM ``` before this change ``` [info] - state store unload/close happens during the maintenance *** FAILED *** (648 milliseconds) [info] Vector("a1", "a10", "a11", "a12", "a13", "a14", "a15", "a16", "a17", "a18", "a19", "a2", "a20", "a3", "a4", "a5", "a6", "a7", "a8", "a9") did not equal ArrayBuffer("a8") (StateStoreSuite.scala:414) [info] Analysis: [info] Vector1(0: "a1" -> "a8", 1: "a10" -> , 2: "a11" -> , 3: "a12" -> , 4: "a13" -> , 5: "a14" -> , 6: "a15" -> , 7: "a16" -> , 8: "a17" -> , 9: "a18" -> , 10: "a19" -> , 11: "a2" -> , 12: "a20" -> , 13: "a3" -> , 14: "a4" -> , 15: "a5" -> , 16: "a6" -> , 17: "a7" -> , 18: "a8" -> , 19: "a9" -> ) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at
Re: [PR] [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser [spark]
cloud-fan commented on code in PR #46400: URL: https://github.com/apache/spark/pull/46400#discussion_r1591749162 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala: ## @@ -67,16 +67,23 @@ case class PartialResultArrayException( extends Exception(cause) /** - * Exception thrown when the underlying parser meet a bad record and can't parse it. + * Exception thrown when the underlying parser meets a bad record and can't parse it. Used for + * control flow between wrapper and underlying parser without overhead of creating a full exception. * @param record a function to return the record that cause the parser to fail * @param partialResults a function that returns an row array, which is the partial results of * parsing this bad record. - * @param cause the actual exception about why the record is bad and can't be parsed. + * @param cause the actual exception about why the record is bad and can't be parsed. It's better + * to use lightweight exception here (e.g. without stacktrace), and throw an + * adequate (user-facing) exception only in case it's necessary after + * unwrapping `BadRecordException` */ case class BadRecordException( @transient record: () => UTF8String, @transient partialResults: () => Array[InternalRow] = () => Array.empty[InternalRow], -cause: Throwable) extends Exception(cause) +cause: Throwable) extends Exception(cause) { Review Comment: `parse_json` is a new function and not release yet, we can still change it. It looks wrong to throw `BadRecordException` as a user-facing error. cc @gene-db ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala: ## @@ -67,16 +67,23 @@ case class PartialResultArrayException( extends Exception(cause) /** - * Exception thrown when the underlying parser meet a bad record and can't parse it. + * Exception thrown when the underlying parser meets a bad record and can't parse it. Used for + * control flow between wrapper and underlying parser without overhead of creating a full exception. * @param record a function to return the record that cause the parser to fail * @param partialResults a function that returns an row array, which is the partial results of * parsing this bad record. - * @param cause the actual exception about why the record is bad and can't be parsed. + * @param cause the actual exception about why the record is bad and can't be parsed. It's better + * to use lightweight exception here (e.g. without stacktrace), and throw an + * adequate (user-facing) exception only in case it's necessary after + * unwrapping `BadRecordException` */ case class BadRecordException( @transient record: () => UTF8String, @transient partialResults: () => Array[InternalRow] = () => Array.empty[InternalRow], -cause: Throwable) extends Exception(cause) +cause: Throwable) extends Exception(cause) { Review Comment: `parse_json` is a new function and not released yet, we can still change it. It looks wrong to throw `BadRecordException` as a user-facing error. cc @gene-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]
dongjoon-hyun commented on PR #46412: URL: https://github.com/apache/spark/pull/46412#issuecomment-2097332326 For the record, the very next commit (SPARK-48147) shows that `build` step (with 10 sub test pipelines) is skipped completely and successfully. https://github.com/apache/spark/assets/9700541/7574d9e6-29ab-498d-9bac-afbbeb361583;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]
HyukjinKwon commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591746209 ## python/pyspark/sql/connect/client/core.py: ## @@ -1157,6 +1163,20 @@ def _analyze_plan_request_with_metadata(self) -> pb2.AnalyzePlanRequest: req.user_context.user_id = self._user_id return req +def _sizeInBytes(self, relation: pb2.Relation) -> AnalyzeResult: Review Comment: `_size_in_bytes` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]
HyukjinKwon commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591746323 ## python/pyspark/sql/dataframe.py: ## @@ -657,6 +657,19 @@ def printSchema(self, level: Optional[int] = None) -> None: """ ... +@dispatch_df_method +def sizeInBytesApproximation(self) -> int: +"""Return approximate size in bytes of a DataFrame. + +.. versionadded:: 4.0.0 + +Returns +--- +int +Size in bytes estimated from optimized plan Review Comment: can we add examples -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]
HyukjinKwon commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591745159 ## .gitignore: ## @@ -26,6 +26,7 @@ .scala_dependencies .settings .vscode +.dir-locals.el Review Comment: let's remove this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression [spark]
HyukjinKwon commented on PR #46404: URL: https://github.com/apache/spark/pull/46404#issuecomment-2097319430 Can you fill the PR description please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48147][SS][CONNECT] Remove client side listeners when local Spark session is deleted [spark]
HyukjinKwon closed pull request #46406: [SPARK-48147][SS][CONNECT] Remove client side listeners when local Spark session is deleted URL: https://github.com/apache/spark/pull/46406 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48147][SS][CONNECT] Remove client side listeners when local Spark session is deleted [spark]
HyukjinKwon commented on PR #46406: URL: https://github.com/apache/spark/pull/46406#issuecomment-2097317802 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply [spark]
db-scnakandala commented on PR #46414: URL: https://github.com/apache/spark/pull/46414#issuecomment-2097313695 cc: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply [spark]
db-scnakandala closed pull request #46413: SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply URL: https://github.com/apache/spark/pull/46413 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]
panbingkun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591736775 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: > ~Like `Kafka` module, we should not include this here.~ > > Do you know how we skip Kafka module's dependency here? I already know about it. It is implemented through `maven scope` overloading. Let me try. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]
dongjoon-hyun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591736077 ## connector/profiler/pom.xml: ## @@ -44,7 +44,7 @@ me.bechberger ap-loader-all - 3.0-8 + 3.0-9 Review Comment: Could you proceed this dependency update separately? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]
gengliangwang commented on PR #46386: URL: https://github.com/apache/spark/pull/46386#issuecomment-2097287397 > That change was not visible to end users, as there was not release made - right ? Yes you are right. Are you ok with the changes in this PR? If not, please let me know your preference and I can check with @cloud-fan to see if we still have time to make further changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]
db-scnakandala commented on code in PR #46307: URL: https://github.com/apache/spark/pull/46307#discussion_r1591732236 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -452,13 +452,14 @@ case class Add( copy(left = newLeft, right = newRight) override lazy val canonicalized: Expression = { -// TODO: do not reorder consecutive `Add`s with different `evalMode` -val reorderResult = buildCanonicalizedPlan( +val evalModes = collectEvalModes(this, {case Add(_, _, evalMode) => Seq(evalMode)}) +lazy val reorderResult = buildCanonicalizedPlan( { case Add(l, r, _) => Seq(l, r) }, Review Comment: That is neat! I will create a follow-up PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]
dongjoon-hyun commented on PR #46412: URL: https://github.com/apache/spark/pull/46412#issuecomment-2097284326 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]
dongjoon-hyun closed pull request #46412: [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed URL: https://github.com/apache/spark/pull/46412 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]
dongjoon-hyun commented on PR #46412: URL: https://github.com/apache/spark/pull/46412#issuecomment-2097282839 Thank you, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]
panbingkun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591730444 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: Yeah, let me add detailed guide in some document. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free [spark]
yaooqinn commented on code in PR #46399: URL: https://github.com/apache/spark/pull/46399#discussion_r1591728412 ## .github/workflows/build_and_test.yml: ## @@ -929,7 +929,7 @@ jobs: HIVE_PROFILE: hive2.3 GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost - ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-free:23.3 + ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-free:23.4 Review Comment: +1 for removal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins [spark]
dongjoon-hyun commented on PR #46363: URL: https://github.com/apache/spark/pull/46363#issuecomment-2097265334 Thank you. +1, LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access [spark]
dongjoon-hyun commented on PR #46411: URL: https://github.com/apache/spark/pull/46411#issuecomment-2097263320 Merged to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access [spark]
dongjoon-hyun closed pull request #46411: [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access URL: https://github.com/apache/spark/pull/46411 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]
dongjoon-hyun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591726179 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: Yes, it's identical with Apache Spark's Kafka module and Apache Spark Hadoop Cloud module - https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.13/3.5.1 - https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud_2.13/3.5.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]
panbingkun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591725780 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: > ~Like `Kafka` module, we should not include this here.~ > > Do you know how we skip Kafka module's dependency here? Let me investigate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]
dongjoon-hyun commented on PR #46412: URL: https://github.com/apache/spark/pull/46412#issuecomment-2097252456 Could you review this PR, @HyukjinKwon ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]
dongjoon-hyun opened a new pull request, #46412: URL: https://github.com/apache/spark/pull/46412 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]
panbingkun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591720244 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: Okay, I will remove it. But I have a question. If the `end user` wants to use `spark-profiler` to analyze the `executor`, does he download 'me.bechberger:ap-loader-all:xxx' from the `maven central repository` and use it together with module `spark-profiler`? If that's the way it used to be, I am fine to it. Maybe we need to add detailed guide in some document? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]
HeartSaVioR commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1591717075 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUpdateEventTimeWatermarkColumn.scala: ## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.plans.logical.{EventTimeWatermark, LogicalPlan, UpdateEventTimeWatermarkColumn} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.errors.QueryCompilationErrors + +/** + * Extracts the watermark delay and adds it to the UpdateEventTimeWatermarkColumn + * logical node (if such a node is present). [[UpdateEventTimeWatermarkColumn]] node updates + * the eventTimeColumn for upstream operators. + * + * If the logical plan contains a [[UpdateEventTimeWatermarkColumn]] node, but no watermark + * has been defined, the query will fail with a compilation error. + */ +object ResolveUpdateEventTimeWatermarkColumn extends Rule[LogicalPlan] { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsDown { Review Comment: Please use the version of pruning, and add pattern for this operator and check with containsPattern. For example, here's the code snippet from TimeWindowing. ``` def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( _.containsPattern(TIME_WINDOW), ruleId) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]
cloud-fan commented on code in PR #46307: URL: https://github.com/apache/spark/pull/46307#discussion_r1591716254 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -452,13 +452,14 @@ case class Add( copy(left = newLeft, right = newRight) override lazy val canonicalized: Expression = { -// TODO: do not reorder consecutive `Add`s with different `evalMode` -val reorderResult = buildCanonicalizedPlan( +val evalModes = collectEvalModes(this, {case Add(_, _, evalMode) => Seq(evalMode)}) +lazy val reorderResult = buildCanonicalizedPlan( { case Add(l, r, _) => Seq(l, r) }, Review Comment: shall we simply add check here? `case Add(l, r, em) if em == evalMode` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]
HeartSaVioR commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1591716219 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUpdateEventTimeWatermarkColumn.scala: ## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.plans.logical.{EventTimeWatermark, LogicalPlan, UpdateEventTimeWatermarkColumn} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.errors.QueryCompilationErrors + +/** + * Extracts the watermark delay and adds it to the UpdateEventTimeWatermarkColumn + * logical node (if such a node is present). [[UpdateEventTimeWatermarkColumn]] node updates + * the eventTimeColumn for upstream operators. + * + * If the logical plan contains a [[UpdateEventTimeWatermarkColumn]] node, but no watermark + * has been defined, the query will fail with a compilation error. + */ +object ResolveUpdateEventTimeWatermarkColumn extends Rule[LogicalPlan] { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsDown { +case u: UpdateEventTimeWatermarkColumn => Review Comment: Also probably a good practice to make the rule be matched once and no longer be matched if it were matched. e.g. if we set the delay in the rule, check the delay in the condition -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]
HeartSaVioR commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1591715505 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUpdateEventTimeWatermarkColumn.scala: ## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.plans.logical.{EventTimeWatermark, LogicalPlan, UpdateEventTimeWatermarkColumn} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.errors.QueryCompilationErrors + +/** + * Extracts the watermark delay and adds it to the UpdateEventTimeWatermarkColumn + * logical node (if such a node is present). [[UpdateEventTimeWatermarkColumn]] node updates + * the eventTimeColumn for upstream operators. + * + * If the logical plan contains a [[UpdateEventTimeWatermarkColumn]] node, but no watermark + * has been defined, the query will fail with a compilation error. + */ +object ResolveUpdateEventTimeWatermarkColumn extends Rule[LogicalPlan] { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsDown { +case u: UpdateEventTimeWatermarkColumn => Review Comment: If we are not clear about the child (or sub-tree) at this moment, we shouldn't apply this rule. `if u.childrenResolved` to apply the rule selectively. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]
HeartSaVioR commented on code in PR #45376: URL: https://github.com/apache/spark/pull/45376#discussion_r1591714252 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -331,6 +331,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor Seq( ResolveWithCTE, ExtractDistributedSequenceID) ++ + Seq(ResolveUpdateEventTimeWatermarkColumn) ++ Review Comment: It's not guaranteed for all resolutions to happen in one loop. fixedPoint means having iterations of application of the set of rules. That said, we shouldn't still assume that child is resolved, and only make the rule to be effective when child is resolved. I'll left a comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]
dongjoon-hyun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591713237 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: ``` $ grep kafka dev/deps/spark-deps-hadoop-3-hive-2.3 | wc -l 0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]
dongjoon-hyun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591711449 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: ~Like `Kafka` module, we should not include this here.~ Do you know how we skip Kafka module's dependency here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]
dongjoon-hyun commented on code in PR #46402: URL: https://github.com/apache/spark/pull/46402#discussion_r1591711449 ## dev/test-dependencies.sh: ## @@ -31,7 +31,7 @@ export LC_ALL=C # NOTE: These should match those in the release publishing script, and be kept in sync with # dev/create-release/release-build.sh HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \ --Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud" +-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler" Review Comment: Like `Kafka` module, we should not include this here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]
dongjoon-hyun commented on PR #46410: URL: https://github.com/apache/spark/pull/46410#issuecomment-2097195862 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]
dongjoon-hyun closed pull request #46410: [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` URL: https://github.com/apache/spark/pull/46410 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access [spark]
HyukjinKwon opened a new pull request, #46411: URL: https://github.com/apache/spark/pull/46411 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/46334 that missed one more test case. ### Why are the changes needed? See https://github.com/apache/spark/pull/46334 ### Does this PR introduce _any_ user-facing change? See https://github.com/apache/spark/pull/46334 ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]
dongjoon-hyun commented on PR #46409: URL: https://github.com/apache/spark/pull/46409#issuecomment-2097193827 Wow! Thank you for the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]
JoshRosen commented on PR #46409: URL: https://github.com/apache/spark/pull/46409#issuecomment-2097192426 It's not flaky, it's a legitimate test failure: ``` [info] - function_try_parse_json *** FAILED *** (5 milliseconds) [info] Expected and actual plans do not match: [info] [info] === Expected Plan === [info] Project [staticinvoke(class org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, VariantType, parseJson, g#0, false, StringType, BooleanType, true, false, true) AS try_parse_json(g)#0] [info] +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] [info] [info] [info] === Actual Plan === [info] Project [staticinvoke(class org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, VariantType, parseJson, g#0, false, StringType, BooleanType, true, true, true) AS try_parse_json(g)#0] [info] +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] (ProtoToParsedPlanTestSuite.scala:205) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47777][PYTHON][SS][TESTS] Add spark connect test for python streaming data source [spark]
HyukjinKwon commented on PR #45950: URL: https://github.com/apache/spark/pull/45950#issuecomment-2097190957 follow https://github.com/apache/spark/blob/master/.github/workflows/build_python_connect.yml#L80-L113 to reproduce the failure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` [spark]
zhengruifeng commented on PR #46397: URL: https://github.com/apache/spark/pull/46397#issuecomment-2097185722 thanks @xinrong-meng merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` [spark]
zhengruifeng closed pull request #46397: [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` URL: https://github.com/apache/spark/pull/46397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]
panbingkun commented on PR #46390: URL: https://github.com/apache/spark/pull/46390#issuecomment-2097183311 @gengliangwang All done. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]
mridulm commented on PR #46386: URL: https://github.com/apache/spark/pull/46386#issuecomment-2097182888 That change was not visible to end users, as there was not release made - right ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]
HyukjinKwon closed pull request #46307: [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply URL: https://github.com/apache/spark/pull/46307 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]
HyukjinKwon commented on PR #46307: URL: https://github.com/apache/spark/pull/46307#issuecomment-2097170436 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]
dongjoon-hyun commented on PR #46410: URL: https://github.com/apache/spark/pull/46410#issuecomment-2097169044 Thank you, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]
dongjoon-hyun commented on PR #46409: URL: https://github.com/apache/spark/pull/46409#issuecomment-2097168838 According to the CI result, maybe `ProtoToParsedPlanTestSuite` seems to become flaky too. ``` [info] *** 1 TEST FAILED *** [error] Failed tests: [error] org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL][DOCS] Correct comments for UnresolvedRelation [spark]
HyukjinKwon closed pull request #46319: [MINOR][SQL][DOCS] Correct comments for UnresolvedRelation URL: https://github.com/apache/spark/pull/46319 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add collation support to `mode` [spark]
GideonPotok closed pull request #46403: Add collation support to `mode` URL: https://github.com/apache/spark/pull/46403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL][DOCS] Correct comments for UnresolvedRelation [spark]
HyukjinKwon commented on PR #46319: URL: https://github.com/apache/spark/pull/46319#issuecomment-2097167163 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]
dongjoon-hyun opened a new pull request, #46410: URL: https://github.com/apache/spark/pull/46410 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting [spark]
HeartSaVioR commented on PR #46351: URL: https://github.com/apache/spark/pull/46351#issuecomment-2097153000 3.5 has a conflict - probably 3.4 would also have a conflict. @huanliwang-db Could you please help submitting backport PRs for these branches? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting [spark]
HeartSaVioR closed pull request #46351: [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting URL: https://github.com/apache/spark/pull/46351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting [spark]
HeartSaVioR commented on PR #46351: URL: https://github.com/apache/spark/pull/46351#issuecomment-2097151349 Thanks! Merging to master/3.5/3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47777][PYTHON][SS][TESTS] Add spark connect test for python streaming data source [spark]
HyukjinKwon commented on PR #45950: URL: https://github.com/apache/spark/pull/45950#issuecomment-2097139796 py4j shouldn't be referred for connect test. can we move them, and import when it's actually used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins [spark]
HyukjinKwon closed pull request #46363: [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins URL: https://github.com/apache/spark/pull/46363 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46848] XML: Enhance XML bad record handling with partial results support [spark]
github-actions[bot] closed pull request #44875: [SPARK-46848] XML: Enhance XML bad record handling with partial results support URL: https://github.com/apache/spark/pull/44875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] [SPARK-46884] Spark Connect - ExecutePlanRequest new property - job description [spark]
github-actions[bot] commented on PR #44909: URL: https://github.com/apache/spark/pull/44909#issuecomment-2097136401 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins [spark]
HyukjinKwon commented on PR #46363: URL: https://github.com/apache/spark/pull/46363#issuecomment-2097136188 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org