Re: [PR] [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs [spark]

2024-05-08 Thread via GitHub
dtenedor commented on PR #46451: URL: https://github.com/apache/spark/pull/46451#issuecomment-2101043782 @HyukjinKwon I fixed the test failures, it should work now :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46477: URL: https://github.com/apache/spark/pull/46477#issuecomment-2101263605 For the record, the result looks clean and good to me. - https://github.com/apache/spark/actions/workflows/build_python_3.10.yml -

Re: [PR] [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once. [spark]

2024-05-08 Thread via GitHub
allisonwang-db commented on code in PR #46481: URL: https://github.com/apache/spark/pull/46481#discussion_r1594584246 ## sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala: ## @@ -326,8 +326,11 @@ class

Re: [PR] [SPARK-48197][SQL] Avoid assert error for invalid lambda function [spark]

2024-05-08 Thread via GitHub
allisonwang-db commented on code in PR #46475: URL: https://github.com/apache/spark/pull/46475#discussion_r1594717536 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -955,7 +955,14 @@ object FunctionRegistry { since:

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46477: URL: https://github.com/apache/spark/pull/46477#issuecomment-2100966088 Could you review this PR, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46479: URL: https://github.com/apache/spark/pull/46479#issuecomment-2101060304 Could you review this PR when you have some time, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-48105][SS][3.5] Fix the race condition between state store unloading and snapshotting [spark]

2024-05-08 Thread via GitHub
huanliwang-db commented on PR #46415: URL: https://github.com/apache/spark/pull/46415#issuecomment-2101059726 @HeartSaVioR the test fails consistently in `[Run / Linters, licenses, dependencies and documentation

Re: [PR] [SPARK-48186][SQL] Add support for AbstractMapType [spark]

2024-05-08 Thread via GitHub
uros-db commented on PR #46458: URL: https://github.com/apache/spark/pull/46458#issuecomment-210717 @dongjoon-hyun resolved, please re-review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-48204][INFRA] Fix release script for Spark 4.0+ [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46484: URL: https://github.com/apache/spark/pull/46484#issuecomment-2101381127 Ack! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun closed pull request #46482: [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs URL: https://github.com/apache/spark/pull/46482 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46482: URL: https://github.com/apache/spark/pull/46482#issuecomment-2101407969 Merged to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46388: URL: https://github.com/apache/spark/pull/46388#issuecomment-2101515455 Let me cherry-pick this to old release branches to apply the same policy there. -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs [spark]

2024-05-08 Thread via GitHub
allisonwang-db commented on code in PR #46451: URL: https://github.com/apache/spark/pull/46451#discussion_r1594719424 ## sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDTFSuite.scala: ## @@ -363,4 +364,29 @@ class PythonUDTFSuite extends QueryTest with

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
viirya commented on PR #46477: URL: https://github.com/apache/spark/pull/46477#issuecomment-2100974388 Looks good to me. On Wed, May 8, 2024 at 9:33 AM Dongjoon Hyun ***@***.***> wrote: > Could you review this PR, @viirya ? > > — > Reply

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46477: URL: https://github.com/apache/spark/pull/46477#issuecomment-2100976472 Thank you so much, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun closed pull request #46477: [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs URL: https://github.com/apache/spark/pull/46477 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations [spark]

2024-05-08 Thread via GitHub
vladimirg-db commented on PR #46478: URL: https://github.com/apache/spark/pull/46478#issuecomment-2101109487 Sure @dongjoon-hyun! Updated the description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-48201][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-08 Thread via GitHub
xinrong-meng commented on PR #46416: URL: https://github.com/apache/spark/pull/46416#issuecomment-2101107224 Let's add `[DOCS]` to the pr title please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once. [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46481: URL: https://github.com/apache/spark/pull/46481#issuecomment-2101258433 cc @HeartSaVioR and @allisonwang-db from #45977 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once. [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on code in PR #46481: URL: https://github.com/apache/spark/pull/46481#discussion_r1594531248 ## sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala: ## @@ -326,8 +326,11 @@ class

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub
hvanhovell commented on code in PR #46245: URL: https://github.com/apache/spark/pull/46245#discussion_r1594250192 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on code in PR #46464: URL: https://github.com/apache/spark/pull/46464#discussion_r1594250748 ## .github/workflows/build_and_test.yml: ## @@ -644,6 +644,7 @@ jobs: python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme

Re: [PR] [SPARK-47421][SQL] Add collation support for URL expressions [spark]

2024-05-08 Thread via GitHub
uros-db commented on PR #46460: URL: https://github.com/apache/spark/pull/46460#issuecomment-2100872209 note: collation awareness for these pass-through Spark expressions required modifying query plans in `query-tests/explain-results/…` in order to accommodate using

Re: [PR] [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun closed pull request #46476: [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 URL: https://github.com/apache/spark/pull/46476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46476: URL: https://github.com/apache/spark/pull/46476#issuecomment-2100889456 Merged to master for Apache Spark 4.0.0-preview. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-48168][SQL] Add bitwise shifting operators support [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46440: URL: https://github.com/apache/spark/pull/46440#issuecomment-2100995752 It seems that TPCDS golden files are affected still. ``` [info] *** 23 TESTS FAILED *** [error] Failed: Total 3499, Failed 23, Errors 0, Passed 3476, Ignored 4 [error]

Re: [PR] [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on code in PR #46479: URL: https://github.com/apache/spark/pull/46479#discussion_r1594377340 ## .github/workflows/build_branch35_python.yml: ## @@ -0,0 +1,45 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-05-08 Thread via GitHub
chaoqin-li1123 commented on PR #45977: URL: https://github.com/apache/spark/pull/45977#issuecomment-2101230022 Yes, I notice that, will send out a fix PR today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47672][SQL] Avoid double eval from filter pushDown [spark]

2024-05-08 Thread via GitHub
holdenk commented on PR #45802: URL: https://github.com/apache/spark/pull/45802#issuecomment-2101227694 CC @cloud-fan do you have thoughts / cycles? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun closed pull request #46480: [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI URL: https://github.com/apache/spark/pull/46480 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once. [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on code in PR #46481: URL: https://github.com/apache/spark/pull/46481#discussion_r1594531248 ## sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala: ## @@ -326,8 +326,11 @@ class

Re: [PR] [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46482: URL: https://github.com/apache/spark/pull/46482#issuecomment-2101405841 Thank you so much, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46356: URL: https://github.com/apache/spark/pull/46356#issuecomment-2101506803 Let me cherry-pick to old release branches~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-08 Thread via GitHub
allisonwang-db commented on code in PR #46416: URL: https://github.com/apache/spark/pull/46416#discussion_r1594722118 ## python/pyspark/sql/streaming/readwriter.py: ## @@ -641,8 +641,8 @@ def csv( Parameters -- -path : str or list Review

Re: [PR] [SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown [spark]

2024-05-08 Thread via GitHub
holdenk commented on PR #46143: URL: https://github.com/apache/spark/pull/46143#issuecomment-2101527557 CC @cloud-fan do you have thoughts / cycles? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46477: URL: https://github.com/apache/spark/pull/46477#issuecomment-2100955572 I realized that new pipeline is an error-prone style. This will choose [the alternative simplest one](https://github.com/apache/spark/pull/46407#discussion_r1591586209), cc

Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]

2024-05-08 Thread via GitHub
gengliangwang commented on code in PR #46434: URL: https://github.com/apache/spark/pull/46434#discussion_r1594306390 ## sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala: ## @@ -707,6 +707,11 @@ class MathFunctionsSuite extends QueryTest with

Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]

2024-05-08 Thread via GitHub
gengliangwang commented on PR #46434: URL: https://github.com/apache/spark/pull/46434#issuecomment-2100956904 @grundprinzip thanks for the work! Let's also mention the new function in https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#useful-functions-for-ansi-mode --

Re: [PR] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-08 Thread via GitHub
chloeh13q commented on PR #46416: URL: https://github.com/apache/spark/pull/46416#issuecomment-2101040390 @HyukjinKwon Yep! Here it is: https://issues.apache.org/jira/browse/SPARK-48201 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun closed pull request #46479: [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI URL: https://github.com/apache/spark/pull/46479 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46479: URL: https://github.com/apache/spark/pull/46479#issuecomment-2101095953 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [MINOR][TEST] fix flaky test for Python data source exactly once. [spark]

2024-05-08 Thread via GitHub
chaoqin-li1123 opened a new pull request, #46481: URL: https://github.com/apache/spark/pull/46481 ### What changes were proposed in this pull request? Fix the flakiness in python streaming source exactly once test. The last executed batch may not be recorded in query progress,

Re: [PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
viirya commented on PR #46477: URL: https://github.com/apache/spark/pull/46477#issuecomment-2101276089 Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun opened a new pull request, #46483: URL: https://github.com/apache/spark/pull/46483 ### What changes were proposed in this pull request? This PR aims to run `pyspark-pandas*` of `branch-3.4` only in PR builder and Daily Python CIs. In other words, only the commit builder

Re: [PR] [SPARK-48148][CORE] JSON objects should not be modified when read as STRING [spark]

2024-05-08 Thread via GitHub
eric-maynard commented on code in PR #46408: URL: https://github.com/apache/spark/pull/46408#discussion_r1594623260 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala: ## @@ -280,13 +280,32 @@ class JacksonParser( case VALUE_STRING =>

Re: [PR] [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun closed pull request #46483: [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs URL: https://github.com/apache/spark/pull/46483 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46483: URL: https://github.com/apache/spark/pull/46483#issuecomment-2101413299 Merged to branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-48205][MINOR] Remove the private[sql] modifier for Python data sources [spark]

2024-05-08 Thread via GitHub
allisonwang-db opened a new pull request, #46487: URL: https://github.com/apache/spark/pull/46487 ### What changes were proposed in this pull request? This PR removes the `private[sql]` modifier for Python data sources to make it consistent with UDFs and UDTFs. ###

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub
hvanhovell commented on code in PR #46245: URL: https://github.com/apache/spark/pull/46245#discussion_r1594229607 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -379,6 +380,15 @@ message ScalarScalaUDF { bool nullable = 4; } +message

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub
hvanhovell commented on code in PR #46245: URL: https://github.com/apache/spark/pull/46245#discussion_r1594227631 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -379,6 +380,15 @@ message ScalarScalaUDF { bool nullable = 4; } +message

Re: [PR] [SPARK-48197][SQL] Avoid assert error for invalid lambda function [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46475: URL: https://github.com/apache/spark/pull/46475#issuecomment-2100882656 Is the UT failure relevant, @cloud-fan ? ``` [info] *** 1 TEST FAILED *** [error] Failed: Total 10583, Failed 1, Errors 0, Passed 10582, Ignored 29 [error] Failed tests:

[PR] [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations [spark]

2024-05-08 Thread via GitHub
vladimirg-db opened a new pull request, #46478: URL: https://github.com/apache/spark/pull/46478 ### What changes were proposed in this pull request? Revert BadRecordException optimizations ### Why are the changes needed? To reduce the blast radius - this will be implemented

Re: [PR] [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46479: URL: https://github.com/apache/spark/pull/46479#issuecomment-2101093865 Thank you so much always for your time, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[PR] [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun opened a new pull request, #46480: URL: https://github.com/apache/spark/pull/46480 ### What changes were proposed in this pull request? This PR aims to create `build_branch34_python.yml` in order to spin off `pyspark` tests from `build_branch34.yml` Daily CI. ###

Re: [PR] [SPARK-47545][CONNECT] Dataset `observe` support for the Scala client [spark]

2024-05-08 Thread via GitHub
hvanhovell closed pull request #45701: [SPARK-47545][CONNECT] Dataset `observe` support for the Scala client URL: https://github.com/apache/spark/pull/45701 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47545][CONNECT] Dataset `observe` support for the Scala client [spark]

2024-05-08 Thread via GitHub
hvanhovell commented on PR #45701: URL: https://github.com/apache/spark/pull/45701#issuecomment-2101304531 Merging! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46483: URL: https://github.com/apache/spark/pull/46483#issuecomment-2101331718 Could you review this backporting PR, @viirya ? Since this is applied to `master` branch successfully, I'm trying to backport this to `branch-3.4`. -- This is an automated

Re: [PR] [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46482: URL: https://github.com/apache/spark/pull/46482#issuecomment-2101331441 Could you review this backporting PR, @viirya ? Since this is applied to `master` branch successfully, I'm trying to backport this to `branch-3.5`. -- This is an automated

Re: [PR] [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant. [spark]

2024-05-08 Thread via GitHub
chenhao-db commented on PR #46486: URL: https://github.com/apache/spark/pull/46486#issuecomment-2101502265 @cloud-fan could you help review? Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub
hvanhovell commented on code in PR #46245: URL: https://github.com/apache/spark/pull/46245#discussion_r1594223201 ## sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala: ## @@ -49,6 +49,7 @@ import

Re: [PR] [SPARK-48162][SQL] Add collation support for MISC expressions [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46461: URL: https://github.com/apache/spark/pull/46461#issuecomment-2100919314 Could you take a look at the UT failures? ``` [info] *** 14 TESTS FAILED *** [error] Failed tests: [error]

Re: [PR] [SPARK-48186][SQL] Add support for AbstractMapType [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46458: URL: https://github.com/apache/spark/pull/46458#issuecomment-2100920787 Could you resolve the conflict? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-48201][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-08 Thread via GitHub
xinrong-meng commented on PR #46416: URL: https://github.com/apache/spark/pull/46416#issuecomment-2101105812 LGTM once CI passes thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47421][SQL] Add collation support for URL expressions [spark]

2024-05-08 Thread via GitHub
uros-db commented on PR #46460: URL: https://github.com/apache/spark/pull/46460#issuecomment-2101119735 @dongjoon-hyun please review this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-05-08 Thread via GitHub
chaoqin-li1123 commented on PR #45977: URL: https://github.com/apache/spark/pull/45977#issuecomment-2101253872 This is the fix https://github.com/apache/spark/pull/46481 @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on code in PR #46480: URL: https://github.com/apache/spark/pull/46480#discussion_r1594514741 ## .github/workflows/build_branch34_python.yml: ## @@ -0,0 +1,45 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46480: URL: https://github.com/apache/spark/pull/46480#issuecomment-2101254900 Could you review this PR too, please, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[PR] [SPARK-48204][INFRA] Fix release script for Spark 4.0+ [spark]

2024-05-08 Thread via GitHub
cloud-fan opened a new pull request, #46484: URL: https://github.com/apache/spark/pull/46484 ### What changes were proposed in this pull request? Before Spark 4.0, Scala 2.12 was primary and Scala 2.13 was secondary. The release scripts build more packages (hadoop3,

Re: [PR] [SPARK-48148][CORE] JSON objects should not be modified when read as STRING [spark]

2024-05-08 Thread via GitHub
sadikovi commented on PR #46408: URL: https://github.com/apache/spark/pull/46408#issuecomment-2101372332 cc @dongjoon-hyun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-48148][CORE] JSON objects should not be modified when read as STRING [spark]

2024-05-08 Thread via GitHub
sadikovi commented on code in PR #46408: URL: https://github.com/apache/spark/pull/46408#discussion_r1594580639 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala: ## @@ -280,13 +280,32 @@ class JacksonParser( case VALUE_STRING =>

Re: [PR] [SPARK-48204][INFRA] Fix release script for Spark 4.0+ [spark]

2024-05-08 Thread via GitHub
cloud-fan commented on PR #46484: URL: https://github.com/apache/spark/pull/46484#issuecomment-2101374437 Note: it does not fix all the issues. The next issue I'm debugging is the pyspark version number mismatch. The script produces pyspark packages with version `4.0.0.dev0` but at the end

Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46434: URL: https://github.com/apache/spark/pull/46434#issuecomment-2100893973 Thank you. BTW, the recent test failure is relevant? For me, it looks irrelevant. Could you take a look at? ``` [info] *** 1 TEST FAILED *** [error] Failed:

[PR] [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun opened a new pull request, #46479: URL: https://github.com/apache/spark/pull/46479 ### What changes were proposed in this pull request? This PR aims to create `build_branch35_python.yml` in order to spin off `pyspark` tests from `build_branch35.yml` Daily CI. ###

Re: [PR] [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46478: URL: https://github.com/apache/spark/pull/46478#issuecomment-2101213015 Thank you for adding the link, @vladimirg-db . cc @cloud-fan and @HyukjinKwon , too -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #45977: URL: https://github.com/apache/spark/pull/45977#issuecomment-2101232136 Thank you so much, @chaoqin-li1123 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46480: URL: https://github.com/apache/spark/pull/46480#issuecomment-2101258836 Thank you so much, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun opened a new pull request, #46482: URL: https://github.com/apache/spark/pull/46482 ### What changes were proposed in this pull request? This PR aims to run `pyspark-pandas*` of `branch-3.5` only in PR builder and Daily Python CIs. In other words, only the commit builder

Re: [PR] [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once. [spark]

2024-05-08 Thread via GitHub
chaoqin-li1123 commented on code in PR #46481: URL: https://github.com/apache/spark/pull/46481#discussion_r1594551803 ## sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala: ## @@ -326,8 +326,11 @@ class

Re: [PR] [SPARK-48204][INFRA] Fix release script for Spark 4.0+ [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46484: URL: https://github.com/apache/spark/pull/46484#issuecomment-2101386315 Feel free to merge and proceed to the remaining release work, @cloud-fan . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46483: URL: https://github.com/apache/spark/pull/46483#issuecomment-2101411144 Thank you so much, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun opened a new pull request, #46477: URL: https://github.com/apache/spark/pull/46477 ### What changes were proposed in this pull request? This PR aims to split `build_python.yml` into per-version cron jobs. Technically, this is a revert of SPARK-48149 . ###

Re: [PR] [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46480: URL: https://github.com/apache/spark/pull/46480#issuecomment-2101283450 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub
hvanhovell commented on code in PR #46245: URL: https://github.com/apache/spark/pull/46245#discussion_r1594544696 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software

[PR] Fix previous reader checks in Vectorized DELTA_BYTE_ARRAY decoder [spark]

2024-05-08 Thread via GitHub
yutsareva opened a new pull request, #46485: URL: https://github.com/apache/spark/pull/46485 ### What changes were proposed in this pull request? Fixed a check in the vectorized DELTA_BYTE_ARRAY parquet decoder to validate that current page reader requires a previous page

[PR] [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant. [spark]

2024-05-08 Thread via GitHub
chenhao-db opened a new pull request, #46486: URL: https://github.com/apache/spark/pull/46486 ### What changes were proposed in this pull request? It adds null checks when accessing a nested element when casting a nested type to variant. It is necessary because the `get` API doesn't

Re: [PR] [SPARK-48205][PYTHON] Remove the private[sql] modifier for Python data sources [spark]

2024-05-08 Thread via GitHub
allisonwang-db commented on PR #46487: URL: https://github.com/apache/spark/pull/46487#issuecomment-2101538253 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-08 Thread via GitHub
flaviaouyang commented on code in PR #46416: URL: https://github.com/apache/spark/pull/46416#discussion_r1594729261 ## python/pyspark/sql/streaming/readwriter.py: ## @@ -641,8 +641,8 @@ def csv( Parameters -- -path : str or list Review

Re: [PR] [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository [spark]

2024-05-08 Thread via GitHub
dongjoon-hyun commented on PR #46470: URL: https://github.com/apache/spark/pull/46470#issuecomment-2101577512 This is backported to `branch-3.5` via https://github.com/apache/spark/commit/82779217b1fa1dea2b18772795969c04c1f34532 -- This is an automated message from the Apache Git

Re: [PR] [SPARK-48100][SQL] Fix issues in skipping nested structure fields not selected in schema [spark]

2024-05-08 Thread via GitHub
HyukjinKwon commented on PR #46348: URL: https://github.com/apache/spark/pull/46348#issuecomment-2101671441 ``` - select with string xml object *** FAILED *** (14 milliseconds)[0m[0m Failed to analyze query: org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION]

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub
mkaravel commented on PR #46180: URL: https://github.com/apache/spark/pull/46180#issuecomment-2101717141 How do we name a trailing-space-insensitive collation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

2024-05-08 Thread via GitHub
GideonPotok commented on code in PR #46404: URL: https://github.com/apache/spark/pull/46404#discussion_r1594756311 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -70,20 +78,46 @@ case class Mode( buffer } -

Re: [PR] [SPARK-48205][PYTHON] Remove the private[sql] modifier for Python data sources [spark]

2024-05-08 Thread via GitHub
HyukjinKwon commented on PR #46487: URL: https://github.com/apache/spark/pull/46487#issuecomment-2101653716 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48205][PYTHON] Remove the private[sql] modifier for Python data sources [spark]

2024-05-08 Thread via GitHub
HyukjinKwon closed pull request #46487: [SPARK-48205][PYTHON] Remove the private[sql] modifier for Python data sources URL: https://github.com/apache/spark/pull/46487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects [spark]

2024-05-08 Thread via GitHub
HyukjinKwon commented on code in PR #46437: URL: https://github.com/apache/spark/pull/46437#discussion_r1594810430 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java: ## @@ -169,7 +171,16 @@ yield visitBinaryArithmetic( }

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub
mkaravel commented on code in PR #46180: URL: https://github.com/apache/spark/pull/46180#discussion_r1594834488 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -117,76 +119,445 @@ public Collation( } /** - *

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub
mkaravel commented on code in PR #46180: URL: https://github.com/apache/spark/pull/46180#discussion_r1594835020 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -117,76 +119,438 @@ public Collation( } /** - *

Re: [PR] [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled [spark]

2024-05-08 Thread via GitHub
HeartSaVioR commented on code in PR #46491: URL: https://github.com/apache/spark/pull/46491#discussion_r1594865560 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala: ## @@ -777,10 +777,19 @@ class RocksDB(

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

2024-05-08 Thread via GitHub
ianmcook commented on code in PR #45481: URL: https://github.com/apache/spark/pull/45481#discussion_r1594880292 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]: assert table is not

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

2024-05-08 Thread via GitHub
ianmcook commented on code in PR #45481: URL: https://github.com/apache/spark/pull/45481#discussion_r1594881200 ## python/pyspark/sql/dataframe.py: ## @@ -6213,6 +6214,31 @@ def mapInArrow( """ ... Review Comment: Do I need ` @dispatch_df_method` here?

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub
panbingkun commented on code in PR #46450: URL: https://github.com/apache/spark/pull/46450#discussion_r1594903597 ## sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java: ## @@ -285,9 +288,10 @@ public String verifyDelegationToken(String

  1   2   3   >