Re: [PR] [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-06 Thread via GitHub


gengliangwang commented on code in PR #46390:
URL: https://github.com/apache/spark/pull/46390#discussion_r1591851400


##
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:
##
@@ -17,21 +17,22 @@
 
 package org.apache.spark.util.collection.unsafe.sort;
 
+import java.io.*;

Review Comment:
   let revert the unnecessary changes in this file



##
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:
##
@@ -17,21 +17,22 @@
 
 package org.apache.spark.util.collection.unsafe.sort;
 
+import java.io.*;

Review Comment:
   let's revert the unnecessary changes in this file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48049][BUILD] Upgrade Scala to 2.13.14 [spark]

2024-05-06 Thread via GitHub


LuciferYang commented on PR #46288:
URL: https://github.com/apache/spark/pull/46288#issuecomment-2097501153

   I think we  need wait a new Ammonite release version, as 3.0-M1 does not 
support Scala 2.13.14.
   
   
![image](https://github.com/apache/spark/assets/1475305/675a2da1-7c7b-476b-88e0-31d9086f6701)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48090][SS][PYTHON][TESTS] Shorten the traceback in the test checking error message in UDF [spark]

2024-05-06 Thread via GitHub


HyukjinKwon opened a new pull request, #46426:
URL: https://github.com/apache/spark/pull/46426

   ### What changes were proposed in this pull request?
   
   This PR reduces traceback so the actual error `ZeroDivisionError` can be 
tested in 
`pyspark.sql.tests.connect.streaming.test_parity_streaming.StreamingParityTests.test_stream_exception`
   
   ### Why are the changes needed?
   
   So long traceback doesn't affect the test case. It can fail as below:
   
   ```
   ==
   FAIL [1.883s]: test_stream_exception 
(pyspark.sql.tests.connect.streaming.test_parity_streaming.StreamingParityTests.test_stream_exception)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 287, in test_stream_exception
   sq.processAllAvailable()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
 line 129, in processAllAvailable
   self._execute_streaming_query_cmd(cmd)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
 line 177, in _execute_streaming_query_cmd
   (_, properties) = self._session.client.execute_command(exec_cmd)
 ^^
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 982, in execute_command
   data, _, _, _, properties = self._execute_and_fetch(req)
   
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
   for response in self._execute_and_fetch_as_iterator(req):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
   self._handle_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 
[150](https://github.com/HyukjinKwon/spark/actions/runs/8978991632/job/24660689666#step:9:151)3,
 in _handle_error
   self._handle_rpc_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.StreamingQueryException: [STREAM_FAILED] 
Query [id = 1c0c440d-0b48-41b1-9a03-071e7e13de82, runId = 
692ec338-963a-43b1-89cb-2a8b7cb1e21a] terminated with exception: Job aborted 
due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: 
Lost task 0.0 in stage 39.0 (TID 58) 
(fv-az714-234.22nzjvkrszmuhkvqy55p1tioig.phxx.internal.cloudapp.net executor 
driver): org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
   process()
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
   serializer.dump_stream(out_iter, outfile)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 224, in dump_stream
   self.serializer.dump_stream(self._batched(iterator), stream)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 145, in dump_stream
   for obj in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 213, in _batched
   for item in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1734, in mapper
   result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in 
udfs)

^
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1734, in 
   result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in 
udfs)
  ^^^
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
112, in 
   return args_kwargs_offsets, lambda *a: func(*a)
  
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 
134, in wrapper
   return f(*args, **kwargs)
  ^^
 File "/home/runner/work/spark/spark-3
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 291, in test_stream_exception
   self._assert_exception_tree_contains_msg(e, "ZeroDivisionError")
 File 

Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46273:
URL: https://github.com/apache/spark/pull/46273#issuecomment-2097474800

   > Looks like this missed the 3.4 release by a month @dongjoon-hyun ... might 
have been a nice addition to it !
   
   Ya, I agree and this is still a good addition to 4.0.0-preview.
   
   Cc @cloud-fan as the release manager of 4.0.0-preview


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]

2024-05-06 Thread via GitHub


gengliangwang commented on PR #46386:
URL: https://github.com/apache/spark/pull/46386#issuecomment-2097446465

   @mridulm As the task name MDC is frequently showing in the logs; I would say 
this is necessary for the new logging framework. After the renaming, the MDC 
names are consistent and simpler to use. Otherwise, it will be 
`executor_id`/`worker_id`/`task_id`/etc and an odd `mdc.taskName`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-06 Thread via GitHub


mridulm commented on PR #46273:
URL: https://github.com/apache/spark/pull/46273#issuecomment-2097438661

   Looks like this missed the 3.4 release @dongjoon-hyun ... might have been a 
nice addition to it !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]

2024-05-06 Thread via GitHub


mridulm commented on PR #46025:
URL: https://github.com/apache/spark/pull/46025#issuecomment-2097435499

   `reader.completed` is checking for the `IN_PROGRESS` suffix - which will be 
the case here @bluzy: and so with this PR, it will never clean up those files.
   (Some users/deployments have out of band workflows that do the `move` to 
remove the `IN_PROGRESS` suffix - but they are not part of Apache Spark itself)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2097432716

   Merged to master for Apache Spark 4.0.0-preview.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun closed pull request #46425: [SPARK-48163][CONNECT][TESTS] Disable 
`SparkConnectServiceSuite.SPARK-43923: commands send events - 
get_resources_command`
URL: https://github.com/apache/spark/pull/46425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]

2024-05-06 Thread via GitHub


mridulm commented on PR #46386:
URL: https://github.com/apache/spark/pull/46386#issuecomment-2097430754

   Is this change strictly necessary would be the question ... if it is, we can 
evaluate it in that context.
   If not and is a nice to have, it is better not to make breaking changes 
which can impact users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2097420985

   Thank you, @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests [spark]

2024-05-06 Thread via GitHub


HyukjinKwon closed pull request #46420: [MINOR][CONNECT][TESTS] Improve test 
failure error message in StreamingParityTests
URL: https://github.com/apache/spark/pull/46420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46420:
URL: https://github.com/apache/spark/pull/46420#issuecomment-2097420760

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2097417035

   Thank you, @HyukjinKwon . I believe this will give a proper and better 
visibility to this test case issue before `Connect GA`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun opened a new pull request, #46425:
URL: https://github.com/apache/spark/pull/46425

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]

2024-05-06 Thread via GitHub


zhengruifeng commented on PR #46418:
URL: https://github.com/apache/spark/pull/46418#issuecomment-2097401736

   thanks @dongjoon-hyun and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]

2024-05-06 Thread via GitHub


zhengruifeng commented on PR #46418:
URL: https://github.com/apache/spark/pull/46418#issuecomment-2097401503

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]

2024-05-06 Thread via GitHub


zhengruifeng closed pull request #46418: [SPARK-48154][PYTHON][CONNECT][TESTS] 
Enable `PandasUDFGroupedAggParityTests.test_manual`
URL: https://github.com/apache/spark/pull/46418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Add collation support for variant expressions [spark]

2024-05-06 Thread via GitHub


uros-db opened a new pull request, #46424:
URL: https://github.com/apache/spark/pull/46424

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Format expressions [spark]

2024-05-06 Thread via GitHub


uros-db opened a new pull request, #46423:
URL: https://github.com/apache/spark/pull/46423

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Add collation support for hash expressions [spark]

2024-05-06 Thread via GitHub


uros-db opened a new pull request, #46422:
URL: https://github.com/apache/spark/pull/46422

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Eliminate unnecessary COLLATE expressions in query analysis [spark]

2024-05-06 Thread via GitHub


uros-db opened a new pull request, #46421:
URL: https://github.com/apache/spark/pull/46421

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48083][SPARK-48084][ML][TESTS] Remove JIRA comments for reenabling ML compatibility tests [spark]

2024-05-06 Thread via GitHub


WeichenXu123 commented on code in PR #46419:
URL: https://github.com/apache/spark/pull/46419#discussion_r1591778164


##
python/pyspark/ml/tests/connect/test_connect_evaluation.py:
##
@@ -20,7 +20,6 @@
 from pyspark.sql import SparkSession
 from pyspark.ml.tests.connect.test_legacy_mode_evaluation import 
EvaluationTestsMixin
 
-# TODO(SPARK-48084): Reenable this test case

Review Comment:
   This test error `pyspark.ml.connect.evaluation not working in 3.5 client <> 
4.0 server` is caused by `cloudpickle` forward incompatibility



##
python/pyspark/ml/tests/connect/test_connect_evaluation.py:
##
@@ -20,7 +20,6 @@
 from pyspark.sql import SparkSession
 from pyspark.ml.tests.connect.test_legacy_mode_evaluation import 
EvaluationTestsMixin
 
-# TODO(SPARK-48084): Reenable this test case

Review Comment:
   This test error `pyspark.ml.connect.evaluation not working in 3.5 client <> 
4.0 server` is caused by `cloudpickle` forward incompatibility, it is not 
related to ML code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48083][SPARK-48084][ML][TESTS] Remove JIRA comments for reenabling ML compatibility tests [spark]

2024-05-06 Thread via GitHub


WeichenXu123 commented on code in PR #46419:
URL: https://github.com/apache/spark/pull/46419#discussion_r1591777525


##
python/pyspark/ml/tests/connect/test_connect_classification.py:
##
@@ -21,7 +21,6 @@
 from pyspark.sql import SparkSession
 from pyspark.ml.tests.connect.test_legacy_mode_classification import 
ClassificationTestsMixin
 
-# TODO(SPARK-48083): Reenable this test case

Review Comment:
   `session.copyFromLocalToFs failure with 3.5 client <> 4.0 server`
   
   this is not an issue,
   `copyFromLocalToFs` requires spark server to config 
`spark.connect.copyFromLocalToFs.allowDestLocal` to False, because the test can 
only use local fs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][CONNECT][TESTS] Improve test failure error message in StreamingParityTests [spark]

2024-05-06 Thread via GitHub


HyukjinKwon opened a new pull request, #46420:
URL: https://github.com/apache/spark/pull/46420

   ### What changes were proposed in this pull request?
   
   This PR improves the test failure error message in `StreamingParityTests`
   
   ### Why are the changes needed?
   
   To see the original exception, and to make it easier to debug. Now it's 
difficult, see 
https://github.com/HyukjinKwon/spark/actions/runs/8978991632/job/24660689666#step:9:206
   
   ```
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 291, in test_stream_exception
   self._assert_exception_tree_contains_msg(e, "ZeroDivisionError")
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 300, in _assert_exception_tree_contains_msg
   self._assert_exception_tree_contains_msg_connect(exception, msg)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 305, in _assert_exception_tree_contains_msg_connect
   self.assertTrue(
   AssertionError: False is not true : Exception tree doesn't contain the 
expected message: ZeroDivisionError
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Manually
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser [spark]

2024-05-06 Thread via GitHub


gene-db commented on code in PR #46400:
URL: https://github.com/apache/spark/pull/46400#discussion_r1591776687


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##
@@ -67,16 +67,23 @@ case class PartialResultArrayException(
   extends Exception(cause)
 
 /**
- * Exception thrown when the underlying parser meet a bad record and can't 
parse it.
+ * Exception thrown when the underlying parser meets a bad record and can't 
parse it. Used for
+ * control flow between wrapper and underlying parser without overhead of 
creating a full exception.
  * @param record a function to return the record that cause the parser to fail
  * @param partialResults a function that returns an row array, which is the 
partial results of
  *  parsing this bad record.
- * @param cause the actual exception about why the record is bad and can't be 
parsed.
+ * @param cause the actual exception about why the record is bad and can't be 
parsed. It's better
+ *  to use lightweight exception here (e.g. without 
stacktrace), and throw an
+ *  adequate (user-facing) exception only in case it's 
necessary after
+ *  unwrapping `BadRecordException`
  */
 case class BadRecordException(
 @transient record: () => UTF8String,
 @transient partialResults: () => Array[InternalRow] = () => 
Array.empty[InternalRow],
-cause: Throwable) extends Exception(cause)
+cause: Throwable) extends Exception(cause) {

Review Comment:
   Yes, we can change `parse_json` exception behavior. What is the desired 
behavior?
   
   cc @chenhao-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun closed pull request #46399: [SPARK-48141][TEST] Update the Oracle 
docker image version used for test and integration to use Oracle Database 23ai 
Free
URL: https://github.com/apache/spark/pull/46399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48083][SPARK-48084][ML][TESTS] Remove JIRAs for reenabling ML compatibility tests [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46419:
URL: https://github.com/apache/spark/pull/46419#issuecomment-2097387986

   Would you mind describing why those skips are legitimate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub


zhengruifeng commented on code in PR #46368:
URL: https://github.com/apache/spark/pull/46368#discussion_r1591773238


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -283,6 +283,16 @@ class Dataset[T] private[sql] (
   def printSchema(level: Int): Unit = println(schema.treeString(level))
   // scalastyle:on println
 
+  /**
+   * Returns an approximate size in bytes of the Dataset.
+   *
+   * @group basic
+   * @since 4.0.0
+   */
+  def sizeInBytesApproximation(): Long = {

Review Comment:
   It seems the size may exceed `Long.MaxValue`:
   
   
https://github.com/apache/spark/blob/111529e0dea68bf5343cc6aabba53b59e5d21830/core/src/main/scala/org/apache/spark/util/Utils.scala#L1147-L1159



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on code in PR #46418:
URL: https://github.com/apache/spark/pull/46418#discussion_r1591773149


##
python/pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py:
##
@@ -20,10 +20,11 @@
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
-class PandasUDFGroupedAggParityTests(GroupedAggPandasUDFTestsMixin, 
ReusedConnectTestCase):
-@unittest.skip("Spark Connect does not support convert UNPARSED to 
catalyst types.")

Review Comment:
   Just a question. Which JIRA issue supports this feature?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48083] [SPARK-48084] Enable spark ml test [spark]

2024-05-06 Thread via GitHub


WeichenXu123 opened a new pull request, #46419:
URL: https://github.com/apache/spark/pull/46419

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48154][PYTHON][CONNECT][TESTS] Enable `PandasUDFGroupedAggParityTests.test_manual` [spark]

2024-05-06 Thread via GitHub


zhengruifeng opened a new pull request, #46418:
URL: https://github.com/apache/spark/pull/46418

   ### What changes were proposed in this pull request?
   Enable `PandasUDFGroupedAggParityTests.test_manual`
   
   
   ### Why are the changes needed?
   for test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1591767311


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala:
##
@@ -347,6 +347,28 @@ class IncrementalExecution(
   eventTimeWatermarkForEviction = 
inputWatermarkForEviction(m.stateInfo.get)
 )
 
+  // UpdateEventTimeColumnExec is used to tag the eventTime column, and 
validate
+  // emitted rows adhere to watermark in the output of transformWithState.
+  // Hence, this node shares the same watermark value as 
TransformWithStateExec.
+  // However, given that UpdateEventTimeColumnExec does not store any 
state, it
+  // does not have any StateInfo. We simply use the StateInfo of 
transformWithStateExec
+  // to propagate watermark to both UpdateEventTimeColumnExec and 
transformWithStateExec.
+  case UpdateEventTimeColumnExec(eventTime, delay, None,
+SerializeFromObjectExec(serializer,
+t: TransformWithStateExec)) if t.stateInfo.isDefined =>
+
+val stateInfo = t.stateInfo.get
+val eventTimeWatermarkForLateEvents = 
inputWatermarkForLateEvents(stateInfo)
+val eventTimeWatermarkForEviction = 
inputWatermarkForLateEvents(stateInfo)
+
+UpdateEventTimeColumnExec(eventTime, delay, 
eventTimeWatermarkForEviction,

Review Comment:
   Note that this would break a test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1591752313


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala:
##
@@ -347,6 +347,28 @@ class IncrementalExecution(
   eventTimeWatermarkForEviction = 
inputWatermarkForEviction(m.stateInfo.get)
 )
 
+  // UpdateEventTimeColumnExec is used to tag the eventTime column, and 
validate
+  // emitted rows adhere to watermark in the output of transformWithState.
+  // Hence, this node shares the same watermark value as 
TransformWithStateExec.
+  // However, given that UpdateEventTimeColumnExec does not store any 
state, it
+  // does not have any StateInfo. We simply use the StateInfo of 
transformWithStateExec
+  // to propagate watermark to both UpdateEventTimeColumnExec and 
transformWithStateExec.
+  case UpdateEventTimeColumnExec(eventTime, delay, None,
+SerializeFromObjectExec(serializer,
+t: TransformWithStateExec)) if t.stateInfo.isDefined =>
+
+val stateInfo = t.stateInfo.get
+val eventTimeWatermarkForLateEvents = 
inputWatermarkForLateEvents(stateInfo)
+val eventTimeWatermarkForEviction = 
inputWatermarkForLateEvents(stateInfo)
+
+UpdateEventTimeColumnExec(eventTime, delay, 
eventTimeWatermarkForEviction,

Review Comment:
   eventTimeWatermarkForLateEvents?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala:
##
@@ -442,6 +442,16 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   case EventTimeWatermark(columnName, delay, child) =>
 EventTimeWatermarkExec(columnName, delay, planLater(child)) :: Nil
 
+  case UpdateEventTimeWatermarkColumn(columnName, delay, child) =>
+// we expect watermarkDelay to be resolved before physical planning.
+if (delay.isEmpty) {
+  // This is a sanity check. We should not reach here as delay is 
updated during
+  // query plan resolution in 
[[ResolveUpdateEventTimeWatermarkColumn]] Analyzer rule.
+  throw SparkException.internalError(
+"You hit a query analyzer bug. Please report your query to Spark 
user mailing list.")

Review Comment:
   This error message is basically provided for all internal errors. Could we 
please add one-liner high-level context here if we have? I guess it'd be just 
OK with stack trace, so 2 cents.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala:
##
@@ -107,25 +109,70 @@ case class EventTimeWatermarkExec(
   }
 
   // Update the metadata on the eventTime column to include the desired delay.
-  override val output: Seq[Attribute] = child.output.map { a =>
-if (a semanticEquals eventTime) {
-  val updatedMetadata = new MetadataBuilder()
-.withMetadata(a.metadata)
-.putLong(EventTimeWatermark.delayKey, delayMs)
-.build()
-  a.withMetadata(updatedMetadata)
-} else if (a.metadata.contains(EventTimeWatermark.delayKey)) {
-  // Remove existing watermark
-  val updatedMetadata = new MetadataBuilder()
-.withMetadata(a.metadata)
-.remove(EventTimeWatermark.delayKey)
-.build()
-  a.withMetadata(updatedMetadata)
-} else {
-  a
-}
+  override val output: Seq[Attribute] = {
+val delayMs = EventTimeWatermark.getDelayMs(delay)
+updateEventTimeColumn(child.output, delayMs, eventTime)
   }
 
   override protected def withNewChildInternal(newChild: SparkPlan): 
EventTimeWatermarkExec =
 copy(child = newChild)
 }
+
+/**
+ * Updates the event time column to [[eventTime]] in the child output.
+ * Any watermark calculations performed after this node will use the
+ * updated eventTimeColumn.
+ *
+ * This node also ensures that output emitted by the child node adheres
+ * to watermark. If the child node emits rows which are older than global
+ * watermark, the node will throw an query execution error and fail the user
+ * query.
+ */
+case class UpdateEventTimeColumnExec(
+eventTime: Attribute,
+delay: CalendarInterval,
+eventTimeWatermarkForEviction: Option[Long],
+child: SparkPlan) extends UnaryExecNode {
+
+  override protected def doExecute(): RDD[InternalRow] = {
+child.execute().mapPartitions[InternalRow] { dataIterator =>
+  val watermarkExpression = WatermarkSupport.watermarkExpression(
+Some(eventTime), eventTimeWatermarkForEviction)
+
+  if (watermarkExpression.isEmpty) {
+// watermark should always be defined in this node.
+throw QueryExecutionErrors.cannotGetEventTimeWatermarkError()
+  }
+
+  val predicate = Predicate.create(watermarkExpression.get, child.output)
+  new Iterator[InternalRow] {
+override def hasNext: Boolean = dataIterator.hasNext
+
+override def next(): InternalRow = {
+  val row = dataIterator.next()

Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-06 Thread via GitHub


panbingkun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591763188


##
connector/profiler/pom.xml:
##
@@ -44,7 +44,7 @@
 
   me.bechberger
   ap-loader-all
-  3.0-8
+  3.0-9

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46409:
URL: https://github.com/apache/spark/pull/46409#issuecomment-2097363087

   Merged to master for Apache Spark 4.0.0-preview. Thank you, @JoshRosen and 
all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun closed pull request #46409: [SPARK-48150][SQL] try_parse_json 
output should be declared as nullable
URL: https://github.com/apache/spark/pull/46409


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-06 Thread via GitHub


chloeh13q opened a new pull request, #46416:
URL: https://github.com/apache/spark/pull/46416

   
   
   
   
   ### What changes were proposed in this pull request?
   
   The docstrings of the pyspark DataStream Reader methods `csv()` and `text()` 
say that the `path` parameter can be a list, but actually when a list is passed 
an error is raised.
   
   ### Why are the changes needed?
   
   Documentation is wrong.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Fixes documentation.
   
   ### How was this patch tested?
   
   N/A
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48105][SS][3.5] Fix the race condition between state store unloading and snapshotting [spark]

2024-05-06 Thread via GitHub


huanliwang-db opened a new pull request, #46415:
URL: https://github.com/apache/spark/pull/46415

   
   
   * When we close the hdfs state store, we should only remove the entry from 
`loadedMaps` rather than doing the active data cleanup. JVM GC should be able 
to help us GC those objects.
   * we should wait for the maintenance thread to stop before unloading the 
providers.
   
   There are two race conditions between state store snapshotting and state 
store unloading which could result in query failure and potential data 
corruption.
   
   Case 1:
   1. the maintenance thread pool encounters some issues and call the 
[stopMaintenanceTask,](https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774)
 this function further calls 
[threadPool.stop.](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587)
 However, this function doesn't wait for the stop operation to be completed and 
move to do the state store [unload and 
clear.](https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778)
   2. the provider unload will [close the state 
store](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721)
 which [clear the values of 
loadedMaps](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355)
 for HDFS backed state store.
   3. if the not-yet-stop maintenance thread is still running and trying to do 
the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has been 
removed. if this snapshot process completes successfully, then we will write 
corrupted data and the following batches will consume this corrupted data.
   
   Case 2:
   
   1. In executor_1, the maintenance thread is going to do the snapshot for 
state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the 
loadedMaps, after this, the maintenance thread [releases the lock of the 
loadedMaps](https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751).
   2. state_store_1 is loaded in another executor, e.g. executor_2.
   3. another state store, state_store_2, is loaded on executor_1 and 
[reportActiveStoreInstance](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871)
 to driver.
   4. executor_1 does the 
[unload](https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713)
 for those no longer active state store which clears the data entries in the 
`HDFSBackedStateStoreMap`
   5. the snapshotting thread is terminated and uploads the incomplete snapshot 
to cloud because the [iterator doesn't have next 
element](https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634)
 after doing the clear.
   6. future batches are consuming the corrupted data.
   
   No
   
   ```
   [info] Run completed in 2 minutes, 55 seconds.
   [info] Total number of tests run: 153
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 153, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   [success] Total time: 271 s (04:31), completed May 2, 2024, 6:26:33 PM
   ```
   before this change
   
   ```
   [info] - state store unload/close happens during the maintenance *** FAILED 
*** (648 milliseconds)
   [info]   Vector("a1", "a10", "a11", "a12", "a13", "a14", "a15", "a16", 
"a17", "a18", "a19", "a2", "a20", "a3", "a4", "a5", "a6", "a7", "a8", "a9") did 
not equal ArrayBuffer("a8") (StateStoreSuite.scala:414)
   [info]   Analysis:
   [info]   Vector1(0: "a1" -> "a8", 1: "a10" -> , 2: "a11" -> , 3: "a12" -> , 
4: "a13" -> , 5: "a14" -> , 6: "a15" -> , 7: "a16" -> , 8: "a17" -> , 9: "a18" 
-> , 10: "a19" -> , 11: "a2" -> , 12: "a20" -> , 13: "a3" -> , 14: "a4" -> , 
15: "a5" -> , 16: "a6" -> , 17: "a7" -> , 18: "a8" -> , 19: "a9" -> )
   [info]   org.scalatest.exceptions.TestFailedException:
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   [info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
   [info]   at 

Re: [PR] [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser [spark]

2024-05-06 Thread via GitHub


cloud-fan commented on code in PR #46400:
URL: https://github.com/apache/spark/pull/46400#discussion_r1591749162


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##
@@ -67,16 +67,23 @@ case class PartialResultArrayException(
   extends Exception(cause)
 
 /**
- * Exception thrown when the underlying parser meet a bad record and can't 
parse it.
+ * Exception thrown when the underlying parser meets a bad record and can't 
parse it. Used for
+ * control flow between wrapper and underlying parser without overhead of 
creating a full exception.
  * @param record a function to return the record that cause the parser to fail
  * @param partialResults a function that returns an row array, which is the 
partial results of
  *  parsing this bad record.
- * @param cause the actual exception about why the record is bad and can't be 
parsed.
+ * @param cause the actual exception about why the record is bad and can't be 
parsed. It's better
+ *  to use lightweight exception here (e.g. without 
stacktrace), and throw an
+ *  adequate (user-facing) exception only in case it's 
necessary after
+ *  unwrapping `BadRecordException`
  */
 case class BadRecordException(
 @transient record: () => UTF8String,
 @transient partialResults: () => Array[InternalRow] = () => 
Array.empty[InternalRow],
-cause: Throwable) extends Exception(cause)
+cause: Throwable) extends Exception(cause) {

Review Comment:
   `parse_json` is a new function and not release yet, we can still change it. 
It looks wrong to throw `BadRecordException` as a user-facing error. cc 
@gene-db 



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##
@@ -67,16 +67,23 @@ case class PartialResultArrayException(
   extends Exception(cause)
 
 /**
- * Exception thrown when the underlying parser meet a bad record and can't 
parse it.
+ * Exception thrown when the underlying parser meets a bad record and can't 
parse it. Used for
+ * control flow between wrapper and underlying parser without overhead of 
creating a full exception.
  * @param record a function to return the record that cause the parser to fail
  * @param partialResults a function that returns an row array, which is the 
partial results of
  *  parsing this bad record.
- * @param cause the actual exception about why the record is bad and can't be 
parsed.
+ * @param cause the actual exception about why the record is bad and can't be 
parsed. It's better
+ *  to use lightweight exception here (e.g. without 
stacktrace), and throw an
+ *  adequate (user-facing) exception only in case it's 
necessary after
+ *  unwrapping `BadRecordException`
  */
 case class BadRecordException(
 @transient record: () => UTF8String,
 @transient partialResults: () => Array[InternalRow] = () => 
Array.empty[InternalRow],
-cause: Throwable) extends Exception(cause)
+cause: Throwable) extends Exception(cause) {

Review Comment:
   `parse_json` is a new function and not released yet, we can still change it. 
It looks wrong to throw `BadRecordException` as a user-facing error. cc 
@gene-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46412:
URL: https://github.com/apache/spark/pull/46412#issuecomment-2097332326

   For the record, the very next commit (SPARK-48147) shows that `build` step 
(with 10 sub test pipelines) is skipped completely and successfully.
   
   https://github.com/apache/spark/assets/9700541/7574d9e6-29ab-498d-9bac-afbbeb361583;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on code in PR #46368:
URL: https://github.com/apache/spark/pull/46368#discussion_r1591746209


##
python/pyspark/sql/connect/client/core.py:
##
@@ -1157,6 +1163,20 @@ def _analyze_plan_request_with_metadata(self) -> 
pb2.AnalyzePlanRequest:
 req.user_context.user_id = self._user_id
 return req
 
+def _sizeInBytes(self, relation: pb2.Relation) -> AnalyzeResult:

Review Comment:
   `_size_in_bytes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on code in PR #46368:
URL: https://github.com/apache/spark/pull/46368#discussion_r1591746323


##
python/pyspark/sql/dataframe.py:
##
@@ -657,6 +657,19 @@ def printSchema(self, level: Optional[int] = None) -> None:
 """
 ...
 
+@dispatch_df_method
+def sizeInBytesApproximation(self) -> int:
+"""Return approximate size in bytes of a DataFrame.
+
+.. versionadded:: 4.0.0
+
+Returns
+---
+int
+Size in bytes estimated from optimized plan

Review Comment:
   can we add examples



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on code in PR #46368:
URL: https://github.com/apache/spark/pull/46368#discussion_r1591745159


##
.gitignore:
##
@@ -26,6 +26,7 @@
 .scala_dependencies
 .settings
 .vscode
+.dir-locals.el

Review Comment:
   let's remove this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46404:
URL: https://github.com/apache/spark/pull/46404#issuecomment-2097319430

   Can you fill the PR description please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48147][SS][CONNECT] Remove client side listeners when local Spark session is deleted [spark]

2024-05-06 Thread via GitHub


HyukjinKwon closed pull request #46406: [SPARK-48147][SS][CONNECT] Remove 
client side listeners when local Spark session is deleted
URL: https://github.com/apache/spark/pull/46406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48147][SS][CONNECT] Remove client side listeners when local Spark session is deleted [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46406:
URL: https://github.com/apache/spark/pull/46406#issuecomment-2097317802

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-05-06 Thread via GitHub


db-scnakandala commented on PR #46414:
URL: https://github.com/apache/spark/pull/46414#issuecomment-2097313695

   cc: @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-05-06 Thread via GitHub


db-scnakandala closed pull request #46413: SPARK-48035][SQL][FOLLOWUP] Fix 
try_add/try_multiply being semantic equal to add/multiply
URL: https://github.com/apache/spark/pull/46413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-06 Thread via GitHub


panbingkun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591736775


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   > ~Like `Kafka` module, we should not include this here.~
   > 
   > Do you know how we skip Kafka module's dependency here?
   
   I already know about it. It is implemented through `maven scope` 
overloading. Let me try.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591736077


##
connector/profiler/pom.xml:
##
@@ -44,7 +44,7 @@
 
   me.bechberger
   ap-loader-all
-  3.0-8
+  3.0-9

Review Comment:
   Could you proceed this dependency update separately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]

2024-05-06 Thread via GitHub


gengliangwang commented on PR #46386:
URL: https://github.com/apache/spark/pull/46386#issuecomment-2097287397

   > That change was not visible to end users, as there was not release made - 
right ?
   
   Yes you are right. 
   Are you ok with the changes in this PR? If not, please let me know your 
preference and I can check with @cloud-fan to see if we still have time to make 
further changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-05-06 Thread via GitHub


db-scnakandala commented on code in PR #46307:
URL: https://github.com/apache/spark/pull/46307#discussion_r1591732236


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -452,13 +452,14 @@ case class Add(
 copy(left = newLeft, right = newRight)
 
   override lazy val canonicalized: Expression = {
-// TODO: do not reorder consecutive `Add`s with different `evalMode`
-val reorderResult = buildCanonicalizedPlan(
+val evalModes = collectEvalModes(this, {case Add(_, _, evalMode) => 
Seq(evalMode)})
+lazy val reorderResult = buildCanonicalizedPlan(
   { case Add(l, r, _) => Seq(l, r) },

Review Comment:
   That is neat! I will create a follow-up PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46412:
URL: https://github.com/apache/spark/pull/46412#issuecomment-2097284326

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun closed pull request #46412: [SPARK-48153][INFRA] Run `build` job 
of `build_and_test.yml` only if needed
URL: https://github.com/apache/spark/pull/46412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46412:
URL: https://github.com/apache/spark/pull/46412#issuecomment-2097282839

   Thank you, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-06 Thread via GitHub


panbingkun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591730444


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   Yeah, let me add detailed guide in some document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free [spark]

2024-05-06 Thread via GitHub


yaooqinn commented on code in PR #46399:
URL: https://github.com/apache/spark/pull/46399#discussion_r1591728412


##
.github/workflows/build_and_test.yml:
##
@@ -929,7 +929,7 @@ jobs:
   HIVE_PROFILE: hive2.3
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
-  ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-free:23.3
+  ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-free:23.4

Review Comment:
   +1 for removal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46363:
URL: https://github.com/apache/spark/pull/46363#issuecomment-2097265334

   Thank you. +1, LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46411:
URL: https://github.com/apache/spark/pull/46411#issuecomment-2097263320

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun closed pull request #46411: 
[SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that 
requires JVM access
URL: https://github.com/apache/spark/pull/46411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591726179


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   Yes, it's identical with Apache Spark's Kafka module and Apache Spark Hadoop 
Cloud module
   
   - 
https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.13/3.5.1
   - 
https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud_2.13/3.5.1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-06 Thread via GitHub


panbingkun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591725780


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   > ~Like `Kafka` module, we should not include this here.~
   > 
   > Do you know how we skip Kafka module's dependency here?
   
   Let me investigate it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46412:
URL: https://github.com/apache/spark/pull/46412#issuecomment-2097252456

   Could you review this PR, @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun opened a new pull request, #46412:
URL: https://github.com/apache/spark/pull/46412

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]

2024-05-06 Thread via GitHub


panbingkun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591720244


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   Okay, I will remove it.
   But I have a question. If the `end user` wants to use `spark-profiler` to 
analyze the `executor`, does he download 'me.bechberger:ap-loader-all:xxx' from 
the `maven central repository` and use it together with module 
`spark-profiler`? If that's the way it used to be, I am fine to it.
   Maybe we need to add detailed guide in some document?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1591717075


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUpdateEventTimeWatermarkColumn.scala:
##
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{EventTimeWatermark, 
LogicalPlan, UpdateEventTimeWatermarkColumn}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.errors.QueryCompilationErrors
+
+/**
+ * Extracts the watermark delay and adds it to the 
UpdateEventTimeWatermarkColumn
+ * logical node (if such a node is present). 
[[UpdateEventTimeWatermarkColumn]] node updates
+ * the eventTimeColumn for upstream operators.
+ *
+ * If the logical plan contains a [[UpdateEventTimeWatermarkColumn]] node, but 
no watermark
+ * has been defined, the query will fail with a compilation error.
+ */
+object ResolveUpdateEventTimeWatermarkColumn extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {

Review Comment:
   Please use the version of pruning, and add pattern for this operator and 
check with containsPattern.
   
   For example, here's the code snippet from TimeWindowing.
   
   ```
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsUpWithPruning(
   _.containsPattern(TIME_WINDOW), ruleId) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-05-06 Thread via GitHub


cloud-fan commented on code in PR #46307:
URL: https://github.com/apache/spark/pull/46307#discussion_r1591716254


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -452,13 +452,14 @@ case class Add(
 copy(left = newLeft, right = newRight)
 
   override lazy val canonicalized: Expression = {
-// TODO: do not reorder consecutive `Add`s with different `evalMode`
-val reorderResult = buildCanonicalizedPlan(
+val evalModes = collectEvalModes(this, {case Add(_, _, evalMode) => 
Seq(evalMode)})
+lazy val reorderResult = buildCanonicalizedPlan(
   { case Add(l, r, _) => Seq(l, r) },

Review Comment:
   shall we simply add check here? `case Add(l, r, em) if em == evalMode`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1591716219


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUpdateEventTimeWatermarkColumn.scala:
##
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{EventTimeWatermark, 
LogicalPlan, UpdateEventTimeWatermarkColumn}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.errors.QueryCompilationErrors
+
+/**
+ * Extracts the watermark delay and adds it to the 
UpdateEventTimeWatermarkColumn
+ * logical node (if such a node is present). 
[[UpdateEventTimeWatermarkColumn]] node updates
+ * the eventTimeColumn for upstream operators.
+ *
+ * If the logical plan contains a [[UpdateEventTimeWatermarkColumn]] node, but 
no watermark
+ * has been defined, the query will fail with a compilation error.
+ */
+object ResolveUpdateEventTimeWatermarkColumn extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
+case u: UpdateEventTimeWatermarkColumn =>

Review Comment:
   Also probably a good practice to make the rule be matched once and no longer 
be matched if it were matched. e.g. if we set the delay in the rule, check the 
delay in the condition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1591715505


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUpdateEventTimeWatermarkColumn.scala:
##
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{EventTimeWatermark, 
LogicalPlan, UpdateEventTimeWatermarkColumn}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.errors.QueryCompilationErrors
+
+/**
+ * Extracts the watermark delay and adds it to the 
UpdateEventTimeWatermarkColumn
+ * logical node (if such a node is present). 
[[UpdateEventTimeWatermarkColumn]] node updates
+ * the eventTimeColumn for upstream operators.
+ *
+ * If the logical plan contains a [[UpdateEventTimeWatermarkColumn]] node, but 
no watermark
+ * has been defined, the query will fail with a compilation error.
+ */
+object ResolveUpdateEventTimeWatermarkColumn extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
+case u: UpdateEventTimeWatermarkColumn =>

Review Comment:
   If we are not clear about the child (or sub-tree) at this moment, we 
shouldn't apply this rule. `if u.childrenResolved` to apply the rule 
selectively.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1591714252


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -331,6 +331,7 @@ class Analyzer(override val catalogManager: CatalogManager) 
extends RuleExecutor
   Seq(
 ResolveWithCTE,
 ExtractDistributedSequenceID) ++
+  Seq(ResolveUpdateEventTimeWatermarkColumn) ++

Review Comment:
   It's not guaranteed for all resolutions to happen in one loop. fixedPoint 
means having iterations of application of the set of rules. That said, we 
shouldn't still assume that child is resolved, and only make the rule to be 
effective when child is resolved. I'll left a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591713237


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   ```
   $ grep kafka dev/deps/spark-deps-hadoop-3-hive-2.3 | wc -l
  0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591711449


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   ~Like `Kafka` module, we should not include this here.~ 
   
   Do you know how we skip Kafka module's dependency here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make the module `spark-profiler` as a part of Spark release [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on code in PR #46402:
URL: https://github.com/apache/spark/pull/46402#discussion_r1591711449


##
dev/test-dependencies.sh:
##
@@ -31,7 +31,7 @@ export LC_ALL=C
 # NOTE: These should match those in the release publishing script, and be kept 
in sync with
 #   dev/create-release/release-build.sh
 HADOOP_MODULE_PROFILES="-Phive-thriftserver -Pkubernetes -Pyarn -Phive \
--Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud"
+-Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud -Pjvm-profiler"

Review Comment:
   Like `Kafka` module, we should not include this here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46410:
URL: https://github.com/apache/spark/pull/46410#issuecomment-2097195862

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun closed pull request #46410: [SPARK-48151][INFRA] 
`build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5`
URL: https://github.com/apache/spark/pull/46410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access [spark]

2024-05-06 Thread via GitHub


HyukjinKwon opened a new pull request, #46411:
URL: https://github.com/apache/spark/pull/46411

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/46334 that 
missed one more test case.
   
   ### Why are the changes needed?
   
   See https://github.com/apache/spark/pull/46334
   
   ### Does this PR introduce _any_ user-facing change?
   
   See https://github.com/apache/spark/pull/46334
   
   ### How was this patch tested?
   
   Manually
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46409:
URL: https://github.com/apache/spark/pull/46409#issuecomment-2097193827

   Wow! Thank you for the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]

2024-05-06 Thread via GitHub


JoshRosen commented on PR #46409:
URL: https://github.com/apache/spark/pull/46409#issuecomment-2097192426

   It's not flaky, it's a legitimate test failure:
   
   ```
   [info] - function_try_parse_json *** FAILED *** (5 milliseconds)
   [info]   Expected and actual plans do not match:
   [info]   
   [info]   === Expected Plan ===
   [info]   Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, 
VariantType, parseJson, g#0, false, StringType, BooleanType, true, false, true) 
AS try_parse_json(g)#0]
   [info]   +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
   [info]   
   [info]   
   [info]   === Actual Plan ===
   [info]   Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, 
VariantType, parseJson, g#0, false, StringType, BooleanType, true, true, true) 
AS try_parse_json(g)#0]
   [info]   +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] 
(ProtoToParsedPlanTestSuite.scala:205)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47777][PYTHON][SS][TESTS] Add spark connect test for python streaming data source [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #45950:
URL: https://github.com/apache/spark/pull/45950#issuecomment-2097190957

   follow 
https://github.com/apache/spark/blob/master/.github/workflows/build_python_connect.yml#L80-L113
 to reproduce the failure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` [spark]

2024-05-06 Thread via GitHub


zhengruifeng commented on PR #46397:
URL: https://github.com/apache/spark/pull/46397#issuecomment-2097185722

   thanks @xinrong-meng 
   
   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` [spark]

2024-05-06 Thread via GitHub


zhengruifeng closed pull request #46397: [SPARK-48142][PYTHON][CONNECT][TESTS] 
Enable `CogroupedApplyInPandasTests.test_wrong_args`
URL: https://github.com/apache/spark/pull/46397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-06 Thread via GitHub


panbingkun commented on PR #46390:
URL: https://github.com/apache/spark/pull/46390#issuecomment-2097183311

   @gengliangwang 
   All done.
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][Core] Unify MDC key `mdc.taskName` and `task_name` [spark]

2024-05-06 Thread via GitHub


mridulm commented on PR #46386:
URL: https://github.com/apache/spark/pull/46386#issuecomment-2097182888

   That change was not visible to end users, as there was not release made - 
right ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-05-06 Thread via GitHub


HyukjinKwon closed pull request #46307: [SPARK-48035][SQL] Fix 
try_add/try_multiply being semantic equal to add/multiply
URL: https://github.com/apache/spark/pull/46307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48035][SQL] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46307:
URL: https://github.com/apache/spark/pull/46307#issuecomment-2097170436

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46410:
URL: https://github.com/apache/spark/pull/46410#issuecomment-2097169044

   Thank you, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48150][SQL] try_parse_json output should be declared as nullable [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun commented on PR #46409:
URL: https://github.com/apache/spark/pull/46409#issuecomment-2097168838

   According to the CI result, maybe `ProtoToParsedPlanTestSuite` seems to 
become flaky too.
   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed tests:
   [error]  org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SQL][DOCS] Correct comments for UnresolvedRelation [spark]

2024-05-06 Thread via GitHub


HyukjinKwon closed pull request #46319: [MINOR][SQL][DOCS] Correct comments for 
UnresolvedRelation
URL: https://github.com/apache/spark/pull/46319


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Add collation support to `mode` [spark]

2024-05-06 Thread via GitHub


GideonPotok closed pull request #46403: Add collation support to `mode`
URL: https://github.com/apache/spark/pull/46403


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SQL][DOCS] Correct comments for UnresolvedRelation [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46319:
URL: https://github.com/apache/spark/pull/46319#issuecomment-2097167163

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` [spark]

2024-05-06 Thread via GitHub


dongjoon-hyun opened a new pull request, #46410:
URL: https://github.com/apache/spark/pull/46410

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on PR #46351:
URL: https://github.com/apache/spark/pull/46351#issuecomment-2097153000

   3.5 has a conflict - probably 3.4 would also have a conflict.
   @huanliwang-db Could you please help submitting backport PRs for these 
branches? Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting [spark]

2024-05-06 Thread via GitHub


HeartSaVioR closed pull request #46351: [SPARK-48105][SS] Fix the race 
condition between state store unloading and snapshotting
URL: https://github.com/apache/spark/pull/46351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48105][SS] Fix the race condition between state store unloading and snapshotting [spark]

2024-05-06 Thread via GitHub


HeartSaVioR commented on PR #46351:
URL: https://github.com/apache/spark/pull/46351#issuecomment-2097151349

   Thanks! Merging to master/3.5/3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47777][PYTHON][SS][TESTS] Add spark connect test for python streaming data source [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #45950:
URL: https://github.com/apache/spark/pull/45950#issuecomment-2097139796

   py4j shouldn't be referred for connect test. can we move them, and import 
when it's actually used?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins [spark]

2024-05-06 Thread via GitHub


HyukjinKwon closed pull request #46363: [SPARK-48112][CONNECT] Expose session 
in SparkConnectPlanner to plugins
URL: https://github.com/apache/spark/pull/46363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46848] XML: Enhance XML bad record handling with partial results support [spark]

2024-05-06 Thread via GitHub


github-actions[bot] closed pull request #44875: [SPARK-46848] XML: Enhance XML 
bad record handling with partial results support
URL: https://github.com/apache/spark/pull/44875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] [SPARK-46884] Spark Connect - ExecutePlanRequest new property - job description [spark]

2024-05-06 Thread via GitHub


github-actions[bot] commented on PR #44909:
URL: https://github.com/apache/spark/pull/44909#issuecomment-2097136401

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48112][CONNECT] Expose session in SparkConnectPlanner to plugins [spark]

2024-05-06 Thread via GitHub


HyukjinKwon commented on PR #46363:
URL: https://github.com/apache/spark/pull/46363#issuecomment-2097136188

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >