[GitHub] [spark] ukby1234 commented on pull request #42155: [SPARK-44547][CORE] Ignore fallback storage for cached RDD migration

2023-08-03 Thread via GitHub
ukby1234 commented on PR #42155: URL: https://github.com/apache/spark/pull/42155#issuecomment-1665042187 It's been a while since I opened this pull request. Can I get someone to review this PR? cc @mridulm -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] HyukjinKwon commented on pull request #42339: [SPARK-44670][PYTHON][TESTS][PS][3.4] Fix 'test_dataframe_conversion.DataFrameConversionTest.get_excel_dfs' test to work with Python 3.7

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42339: URL: https://github.com/apache/spark/pull/42339#issuecomment-1665033273 cc @xinrong-meng too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon closed pull request #42206: [SPARK-44582][SQL] Skip iterator on SMJ if it was cleaned up

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42206: [SPARK-44582][SQL] Skip iterator on SMJ if it was cleaned up URL: https://github.com/apache/spark/pull/42206 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] HyukjinKwon commented on pull request #42206: [SPARK-44582][SQL] Skip iterator on SMJ if it was cleaned up

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42206: URL: https://github.com/apache/spark/pull/42206#issuecomment-1665029179 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] Madhukar98 opened a new pull request, #42339: [SPARK-44670][PYTHON] Fix the tests for python3.7

2023-08-03 Thread via GitHub
Madhukar98 opened a new pull request, #42339: URL: https://github.com/apache/spark/pull/42339 ### What changes were proposed in this pull request? The fix is to use openpyxl by default instead of xlrd. ### Why are the changes needed? test_to_excel test case was

[GitHub] [spark] bersprockets commented on pull request #42206: [SPARK-44582][SQL] Skip iterator on SMJ if it was cleaned up

2023-08-03 Thread via GitHub
bersprockets commented on PR #42206: URL: https://github.com/apache/spark/pull/42206#issuecomment-1664996034 Thanks. Looks good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #42338: [SPARK-44671][PYTHON][CONNECT] Retry ExecutePlan in case initial request didn't reach server in Python client

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42338: URL: https://github.com/apache/spark/pull/42338#issuecomment-1664963753 cc @juliuszsompolski @zhengruifeng @ueshin Please take a look  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] HyukjinKwon opened a new pull request, #42338: [SPARK-44671][PYTHON][CONNECT] Retry ExecutePlan in case initial request didn't reach server in Python client

2023-08-03 Thread via GitHub
HyukjinKwon opened a new pull request, #42338: URL: https://github.com/apache/spark/pull/42338 ### What changes were proposed in this pull request? The fix for the symmetry to https://github.com/apache/spark/pull/42282. ### Why are the changes needed? See also

[GitHub] [spark] hvanhovell commented on pull request #42331: [SPARK-44656][CONNECT] Make all iterators CloseableIterators

2023-08-03 Thread via GitHub
hvanhovell commented on PR #42331: URL: https://github.com/apache/spark/pull/42331#issuecomment-1664950906 A bit of a monkey wrench. I am fine with the current approach. I am just wondering if at this point using the GRPC iterators is the easiest? Would it be easier to use a stream

[GitHub] [spark] pan3793 commented on pull request #42336: [SPARK-44669][SQL][HIVE] Parquet/ORC files written using Hive Serde should has file extension

2023-08-03 Thread via GitHub
pan3793 commented on PR #42336: URL: https://github.com/apache/spark/pull/42336#issuecomment-1664948047 cc @wangyum @ulysses-you @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] allisonwang-db commented on pull request #42302: [SPARK-44640][PYTHON] Improve error messages for Python UDTF returning non Iterable

2023-08-03 Thread via GitHub
allisonwang-db commented on PR #42302: URL: https://github.com/apache/spark/pull/42302#issuecomment-1664945741 Yup we need this in branch-3.5. Created https://github.com/apache/spark/pull/42337 -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] allisonwang-db opened a new pull request, #42337: [SPARK-44640][PYTHON][3.5] Improve error messages for Python UDTF returning non Iterable

2023-08-03 Thread via GitHub
allisonwang-db opened a new pull request, #42337: URL: https://github.com/apache/spark/pull/42337 … This PR improves error messages when the result of a Python UDTF is not an Iterable. It also improves the error messages when a UDTF encounters an exception when executing `eval`.

[GitHub] [spark] pan3793 opened a new pull request, #42336: [SPARK-44669][SQL][HIVE] Parquet/ORC files written using Hive Serde should has file extension

2023-08-03 Thread via GitHub
pan3793 opened a new pull request, #42336: URL: https://github.com/apache/spark/pull/42336 ### What changes were proposed in this pull request? Add file extensions for Parquet/ORC files written using Hive Serde, to keep behavior consistent with Spark DataSource

[GitHub] [spark] liangyu-1 commented on pull request #42295: [SPARK-44581][YARN]Fix the bug that ShutdownHookManager get wrong hadoop user group information

2023-08-03 Thread via GitHub
liangyu-1 commented on PR #42295: URL: https://github.com/apache/spark/pull/42295#issuecomment-1664932235 > The staging directory is cleaned automatically by Spark, why do you even need this hook? @yaooqinn Spark cleans the staging directory in this Hook, in spark 2.4

[GitHub] [spark] HyukjinKwon commented on pull request #42118: [SPARK-44264][PYTHON]E2E Testing for Deepspeed

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42118: URL: https://github.com/apache/spark/pull/42118#issuecomment-1664932454 @mathewjacob1002 and @maddiedawson can you follow up ^ please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] HyukjinKwon commented on pull request #42332: [SPARK-44665][PYTHON] Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42332: URL: https://github.com/apache/spark/pull/42332#issuecomment-1664931905 cc @allisonwang-db @xinrong-meng @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] asl3 commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
asl3 commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283948313 ## python/docs/source/getting_started/index.rst: ## @@ -40,3 +40,4 @@ The list below is the contents of this quickstart page: quickstart_df quickstart_connect

[GitHub] [spark] 7mming7 opened a new pull request, #42335: [SPARK-44654][SQL]Optimize InSubquery Partition pruning

2023-08-03 Thread via GitHub
7mming7 opened a new pull request, #42335: URL: https://github.com/apache/spark/pull/42335 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change?

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42332: [SPARK-44665][PYTHON] Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42332: URL: https://github.com/apache/spark/pull/42332#discussion_r1283947706 ## python/pyspark/testing/pandasutils.py: ## @@ -159,13 +160,26 @@ def _assert_pandas_almost_equal( This function checks if given pandas objects approximately

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42332: [SPARK-44665][PYTHON] Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42332: URL: https://github.com/apache/spark/pull/42332#discussion_r1283947475 ## python/pyspark/testing/pandasutils.py: ## @@ -159,13 +160,26 @@ def _assert_pandas_almost_equal( This function checks if given pandas objects approximately

[GitHub] [spark] zhengruifeng opened a new pull request, #42334: [SPARK-44667][INFRA] Uninstall large ML libraries for non-ML jobs

2023-08-03 Thread via GitHub
zhengruifeng opened a new pull request, #42334: URL: https://github.com/apache/spark/pull/42334 ### What changes were proposed in this pull request? Uninstall large ML libraries for non-ML jobs ### Why are the changes needed? ML is integrating external frameworks: torch,

[GitHub] [spark] zhengruifeng opened a new pull request, #42333: [SPARK-44618][INFRA] Uninstall CodeQL/Go/Node in non-container jobs

2023-08-03 Thread via GitHub
zhengruifeng opened a new pull request, #42333: URL: https://github.com/apache/spark/pull/42333 ### What changes were proposed in this pull request? Uninstall CodeQL/Go/Node in non-container jobs ### Why are the changes needed? it can save 10G disk space before this

[GitHub] [spark] yaooqinn commented on pull request #42287: [SPARK-44632][CORE] DiskBlockManager should check and be able to handle stale directories

2023-08-03 Thread via GitHub
yaooqinn commented on PR #42287: URL: https://github.com/apache/spark/pull/42287#issuecomment-1664924280 cc @tgravescs @cloud-fan @HyukjinKwon, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] cloud-fan closed pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching

2023-08-03 Thread via GitHub
cloud-fan closed pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching URL: https://github.com/apache/spark/pull/42315 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] cloud-fan commented on pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching

2023-08-03 Thread via GitHub
cloud-fan commented on PR #42315: URL: https://github.com/apache/spark/pull/42315#issuecomment-1664918205 thanks for the review, merging to master/3.5/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] wangyum commented on a diff in pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching

2023-08-03 Thread via GitHub
wangyum commented on code in PR #42315: URL: https://github.com/apache/spark/pull/42315#discussion_r1283936632 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2272,9 +2316,7 @@ class Dataset[T] private[sql]( * @since 2.0.0 */ def union(other:

[GitHub] [spark] zhengruifeng commented on pull request #42118: [SPARK-44264][PYTHON]E2E Testing for Deepspeed

2023-08-03 Thread via GitHub
zhengruifeng commented on PR #42118: URL: https://github.com/apache/spark/pull/42118#issuecomment-1664917362 following tests are actually skipped: ``` Skipped tests in pyspark.ml.deepspeed.tests.test_deepspeed_distributor with python3.9: test_pytorch_file_e2e

[GitHub] [spark] cloud-fan commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

2023-08-03 Thread via GitHub
cloud-fan commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1664916455 For merging `func1(...) ... WHERE cond1` and `func2(...) ... WHERE cond2`, we got ``` func1(...) FILTER cond1, func2(...) FILTER cond2 ... WHERE cond1 OR cond2 ```

[GitHub] [spark] HyukjinKwon closed pull request #42282: [SPARK-44624][CONNECT] Retry ExecutePlan in case initial request didn't reach server

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42282: [SPARK-44624][CONNECT] Retry ExecutePlan in case initial request didn't reach server URL: https://github.com/apache/spark/pull/42282 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] cloud-fan commented on a diff in pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching

2023-08-03 Thread via GitHub
cloud-fan commented on code in PR #42315: URL: https://github.com/apache/spark/pull/42315#discussion_r1283926955 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2272,9 +2316,7 @@ class Dataset[T] private[sql]( * @since 2.0.0 */ def union(other:

[GitHub] [spark] HyukjinKwon closed pull request #42330: [SPARK-44664][PYTHON][CONNECT] Release the execute when closing the iterator in Python client

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42330: [SPARK-44664][PYTHON][CONNECT] Release the execute when closing the iterator in Python client URL: https://github.com/apache/spark/pull/42330 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] yaooqinn commented on pull request #42295: [SPARK-44581][YARN]Fix the bug that ShutdownHookManager get wrong hadoop user group information

2023-08-03 Thread via GitHub
yaooqinn commented on PR #42295: URL: https://github.com/apache/spark/pull/42295#issuecomment-1664906674 The staging directory is cleaned automatically by Spark, why do you even need this hook? -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #42282: [SPARK-44624][CONNECT] Retry ExecutePlan in case initial request didn't reach server

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42282: URL: https://github.com/apache/spark/pull/42282#issuecomment-1664906517 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on pull request #42330: [SPARK-44664][PYTHON][CONNECT] Release the execute when closing the iterator in Python client

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42330: URL: https://github.com/apache/spark/pull/42330#issuecomment-1664906302 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] wangyum commented on a diff in pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching

2023-08-03 Thread via GitHub
wangyum commented on code in PR #42315: URL: https://github.com/apache/spark/pull/42315#discussion_r1283918802 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2272,9 +2316,7 @@ class Dataset[T] private[sql]( * @since 2.0.0 */ def union(other:

[GitHub] [spark] beliefer commented on a diff in pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

2023-08-03 Thread via GitHub
beliefer commented on code in PR #42223: URL: https://github.com/apache/spark/pull/42223#discussion_r1283905000 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CombineJoinedAggregates.scala: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] beliefer commented on a diff in pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

2023-08-03 Thread via GitHub
beliefer commented on code in PR #42223: URL: https://github.com/apache/spark/pull/42223#discussion_r1283903546 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CombineJoinedAggregates.scala: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] asl3 commented on a diff in pull request #42332: [SPARK-44665] Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread via GitHub
asl3 commented on code in PR #42332: URL: https://github.com/apache/spark/pull/42332#discussion_r1283898250 ## python/pyspark/sql/tests/test_utils.py: ## @@ -746,28 +748,123 @@ def test_assert_unequal_null_expected(self): ) def

[GitHub] [spark] beliefer commented on a diff in pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

2023-08-03 Thread via GitHub
beliefer commented on code in PR #42223: URL: https://github.com/apache/spark/pull/42223#discussion_r1283895467 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateJoinByCombineAggregate.scala: ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache

[GitHub] [spark] zhengruifeng commented on pull request #42253: [SPARK-44619][INFRA] Free up disk space for container jobs

2023-08-03 Thread via GitHub
zhengruifeng commented on PR #42253: URL: https://github.com/apache/spark/pull/42253#issuecomment-1664861475 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] zhengruifeng closed pull request #42253: [SPARK-44619][INFRA] Free up disk space for container jobs

2023-08-03 Thread via GitHub
zhengruifeng closed pull request #42253: [SPARK-44619][INFRA] Free up disk space for container jobs URL: https://github.com/apache/spark/pull/42253 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] ulysses-you commented on a diff in pull request #42318: [SPARK-44655][SQL] Make the code cleaner about static and dynamic data/partition filters

2023-08-03 Thread via GitHub
ulysses-you commented on code in PR #42318: URL: https://github.com/apache/spark/pull/42318#discussion_r1283886691 ## sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala: ## @@ -371,49 +373,47 @@ trait FileSourceScanLike extends DataSourceScanExec {

[GitHub] [spark] HyukjinKwon closed pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch URL: https://github.com/apache/spark/pull/42316 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42332: [SPARK-44665] Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread via GitHub
zhengruifeng commented on code in PR #42332: URL: https://github.com/apache/spark/pull/42332#discussion_r1283884122 ## python/pyspark/sql/tests/test_utils.py: ## @@ -746,28 +748,123 @@ def test_assert_unequal_null_expected(self): ) def

[GitHub] [spark] HyukjinKwon commented on pull request #42316: [SPARK-40770][PYTHON][FOLLOW-UP][3.5] Improved error messages for mapInPandas for schema mismatch

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42316: URL: https://github.com/apache/spark/pull/42316#issuecomment-1664851259 Merged to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon closed pull request #42268: [SPARK-43562][SPARK-43870][PS] Remove APIs from `DataFrame` and `Series`

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42268: [SPARK-43562][SPARK-43870][PS] Remove APIs from `DataFrame` and `Series` URL: https://github.com/apache/spark/pull/42268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HyukjinKwon commented on pull request #42268: [SPARK-43562][SPARK-43870][PS] Remove APIs from `DataFrame` and `Series`

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42268: URL: https://github.com/apache/spark/pull/42268#issuecomment-1664850264 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon closed pull request #42319: [SPARK-43873][PS] Enabling `FrameDescribeTests`

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42319: [SPARK-43873][PS] Enabling `FrameDescribeTests` URL: https://github.com/apache/spark/pull/42319 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #42319: [SPARK-43873][PS] Enabling `FrameDescribeTests`

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42319: URL: https://github.com/apache/spark/pull/42319#issuecomment-1664849527 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283881548 ## python/docs/source/getting_started/index.rst: ## @@ -40,3 +40,4 @@ The list below is the contents of this quickstart page: quickstart_df

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283881332 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283881012 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283880817 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283880695 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283880513 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283880415 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

2023-08-03 Thread via GitHub
HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283880241 ## python/docs/source/getting_started/testing_pyspark.ipynb: ## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id":

[GitHub] [spark] HyukjinKwon commented on pull request #42302: [SPARK-44640][PYTHON] Improve error messages for Python UDTF returning non Iterable

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42302: URL: https://github.com/apache/spark/pull/42302#issuecomment-1664841895 It has a conflict with branch-3.5. Should we create a PR for it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] HyukjinKwon closed pull request #42302: [SPARK-44640][PYTHON] Improve error messages for Python UDTF returning non Iterable

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42302: [SPARK-44640][PYTHON] Improve error messages for Python UDTF returning non Iterable URL: https://github.com/apache/spark/pull/42302 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] HyukjinKwon commented on pull request #42302: [SPARK-44640][PYTHON] Improve error messages for Python UDTF returning non Iterable

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42302: URL: https://github.com/apache/spark/pull/42302#issuecomment-1664841189 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cloud-fan commented on a diff in pull request #42315: [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching

2023-08-03 Thread via GitHub
cloud-fan commented on code in PR #42315: URL: https://github.com/apache/spark/pull/42315#discussion_r1283875186 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -157,7 +157,7 @@ abstract class Optimizer(catalogManager:

[GitHub] [spark] LuciferYang commented on pull request #42236: [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use `spark-proto` uber jar to test the `connect` module

2023-08-03 Thread via GitHub
LuciferYang commented on PR #42236: URL: https://github.com/apache/spark/pull/42236#issuecomment-1664828251 While I'm not certain if it's reasonable, I still want to point out that relocating the content of the `spark-protobuf` module may result to a poorer user experience: In order to use

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42309: [SPARK-44644][PYTHON] Improve error messages for Python UDTFs with pickling errors

2023-08-03 Thread via GitHub
allisonwang-db commented on code in PR #42309: URL: https://github.com/apache/spark/pull/42309#discussion_r1283860273 ## python/pyspark/cloudpickle/cloudpickle_fast.py: ## @@ -631,7 +631,7 @@ def dump(self, obj): try: return Pickler.dump(self, obj)

[GitHub] [spark] LuciferYang commented on pull request #42236: [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use `spark-proto` uber jar to test the `connect` module

2023-08-03 Thread via GitHub
LuciferYang commented on PR #42236: URL: https://github.com/apache/spark/pull/42236#issuecomment-1664798658 > Would it be easier if we change maven to use the unshaded jar?

[GitHub] [spark] LuciferYang opened a new pull request, #41466: [SPARK-43646][PROTOBUF][BUILD] Split `protobuf-assembly` module from `protobuf` module

2023-08-03 Thread via GitHub
LuciferYang opened a new pull request, #41466: URL: https://github.com/apache/spark/pull/41466 ### What changes were proposed in this pull request? There will be maven test failed of connect server module before this pr: run ``` build/mvn clean install -DskipTests

[GitHub] [spark] LuciferYang commented on pull request #41466: [SPARK-43646][PROTOBUF][BUILD] Split `protobuf-assembly` module from `protobuf` module

2023-08-03 Thread via GitHub
LuciferYang commented on PR #41466: URL: https://github.com/apache/spark/pull/41466#issuecomment-1664798514 re open -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] asl3 opened a new pull request, #42332: [SPARK-44665] Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread via GitHub
asl3 opened a new pull request, #42332: URL: https://github.com/apache/spark/pull/42332 ### What changes were proposed in this pull request? This PR adds support for pandas DataFrame in `assertDataFrameEqual`, while delaying all pandas imports until pandas environment dependency is

[GitHub] [spark] LuciferYang commented on a diff in pull request #42236: [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use `spark-proto` uber jar to test the `connect` module

2023-08-03 Thread via GitHub
LuciferYang commented on code in PR #42236: URL: https://github.com/apache/spark/pull/42236#discussion_r1283844577 ## connector/connect/server/src/test/resources/test.proto: ## @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + *

[GitHub] [spark] github-actions[bot] closed pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-08-03 Thread via GitHub
github-actions[bot] closed pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni) URL: https://github.com/apache/spark/pull/38171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] github-actions[bot] commented on pull request #40629: [SPARK-42980][CORE] Implement a lightweight SmallBroadcast

2023-08-03 Thread via GitHub
github-actions[bot] commented on PR #40629: URL: https://github.com/apache/spark/pull/40629#issuecomment-1664792440 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #40665: [SPARK-42621][PS] Add inclusive parameter for pd.date_range

2023-08-03 Thread via GitHub
github-actions[bot] commented on PR #40665: URL: https://github.com/apache/spark/pull/40665#issuecomment-1664792424 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #40918: [WIP][CORE] Add shuffle sort merge joins to RDD API

2023-08-03 Thread via GitHub
github-actions[bot] commented on PR #40918: URL: https://github.com/apache/spark/pull/40918#issuecomment-1664792404 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader

2023-08-03 Thread via GitHub
github-actions[bot] commented on PR #40929: URL: https://github.com/apache/spark/pull/40929#issuecomment-1664792377 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] closed pull request #40930: [DO NOT MERGE] File constant metadata extractors split

2023-08-03 Thread via GitHub
github-actions[bot] closed pull request #40930: [DO NOT MERGE] File constant metadata extractors split URL: https://github.com/apache/spark/pull/40930 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] HyukjinKwon commented on pull request #42283: [SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with `removeListener` and improvements

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42283: URL: https://github.com/apache/spark/pull/42283#issuecomment-1664786685 @WweiL it has a conflict with branch-3.5. Mind resolving them and create a PR please? -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] juliuszsompolski commented on pull request #42320: [SPARK-44656][CONNECT][FOLLOWUP] Close Iterators in SparkResult as well.

2023-08-03 Thread via GitHub
juliuszsompolski commented on PR #42320: URL: https://github.com/apache/spark/pull/42320#issuecomment-1664786638 Thank you @cdkrot . I continued working on it and incorporated it in https://github.com/apache/spark/pull/42331. That should supersede this. -- This is an automated message

[GitHub] [spark] HyukjinKwon closed pull request #42283: [SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with `removeListener` and improvements

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42283: [SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with `removeListener` and improvements URL: https://github.com/apache/spark/pull/42283 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #42283: [SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with `removeListener` and improvements

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42283: URL: https://github.com/apache/spark/pull/42283#issuecomment-1664785828 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] juliuszsompolski opened a new pull request, #42331: [SPARK-44656][CONNECT] Make all iterators CloseableIterators

2023-08-03 Thread via GitHub
juliuszsompolski opened a new pull request, #42331: URL: https://github.com/apache/spark/pull/42331 ### What changes were proposed in this pull request? This makes sure that all iterators used in Spark Connect scala client are `CloseableIterator`. 1. Makes

[GitHub] [spark] HyukjinKwon commented on pull request #42330: [SPARK-44664][PYTHON][CONNECT] Release the execute when closing the iterator in Python client

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42330: URL: https://github.com/apache/spark/pull/42330#issuecomment-1664782620 cc @juliuszsompolski, @cdkrot, @zhengruifeng and @ueshin FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] HyukjinKwon opened a new pull request, #42330: [SPARK-44664][PYTHON][CONNECT] Release the execute when closing the iterator in Python client

2023-08-03 Thread via GitHub
HyukjinKwon opened a new pull request, #42330: URL: https://github.com/apache/spark/pull/42330 ### What changes were proposed in this pull request? This PR implements the symmetry of https://github.com/apache/spark/pull/42304 and https://github.com/apache/spark/pull/42320.

[GitHub] [spark] srowen commented on pull request #42322: [MINOR][DOC] Fix a typo in ResolveReferencesInUpdate scaladoc

2023-08-03 Thread via GitHub
srowen commented on PR #42322: URL: https://github.com/apache/spark/pull/42322#issuecomment-1664773317 Merged to master/3.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] srowen closed pull request #42322: [MINOR][DOC] Fix a typo in ResolveReferencesInUpdate scaladoc

2023-08-03 Thread via GitHub
srowen closed pull request #42322: [MINOR][DOC] Fix a typo in ResolveReferencesInUpdate scaladoc URL: https://github.com/apache/spark/pull/42322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] sdruzkin commented on pull request #42322: [MINOR][DOC] Fix a typo in ResolveReferencesInUpdate scaladoc

2023-08-03 Thread via GitHub
sdruzkin commented on PR #42322: URL: https://github.com/apache/spark/pull/42322#issuecomment-1664771686 Tests are green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] HyukjinKwon commented on pull request #42320: [SPARK-44656][CONNECT][FOLLOWUP] Close Iterators in SparkResult as well.

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42320: URL: https://github.com/apache/spark/pull/42320#issuecomment-1664769082 Merged https://github.com/apache/spark/pull/42304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HyukjinKwon closed pull request #42304: [SPARK-44642][CONNECT] ReleaseExecute in ExecutePlanResponseReattachableIterator after it gets error from server

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42304: [SPARK-44642][CONNECT] ReleaseExecute in ExecutePlanResponseReattachableIterator after it gets error from server URL: https://github.com/apache/spark/pull/42304 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #42304: [SPARK-44642][CONNECT] ReleaseExecute in ExecutePlanResponseReattachableIterator after it gets error from server

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42304: URL: https://github.com/apache/spark/pull/42304#issuecomment-1664768521 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on pull request #42314: [SPARK-44652] Raise error when only one df is None

2023-08-03 Thread via GitHub
HyukjinKwon commented on PR #42314: URL: https://github.com/apache/spark/pull/42314#issuecomment-1664766927 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon closed pull request #42314: [SPARK-44652] Raise error when only one df is None

2023-08-03 Thread via GitHub
HyukjinKwon closed pull request #42314: [SPARK-44652] Raise error when only one df is None URL: https://github.com/apache/spark/pull/42314 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] ueshin commented on a diff in pull request #42283: [SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with `removeListener` and improvements

2023-08-03 Thread via GitHub
ueshin commented on code in PR #42283: URL: https://github.com/apache/spark/pull/42283#discussion_r1283804782 ## core/src/main/scala/org/apache/spark/api/python/StreamingPythonRunner.scala: ## @@ -60,9 +69,9 @@ private[spark] class StreamingPythonRunner(func: PythonFunction,

[GitHub] [spark] ueshin commented on a diff in pull request #42283: [SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with `removeListener` and improvements

2023-08-03 Thread via GitHub
ueshin commented on code in PR #42283: URL: https://github.com/apache/spark/pull/42283#discussion_r1283804782 ## core/src/main/scala/org/apache/spark/api/python/StreamingPythonRunner.scala: ## @@ -60,9 +69,9 @@ private[spark] class StreamingPythonRunner(func: PythonFunction,

[GitHub] [spark] dtenedor commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-03 Thread via GitHub
dtenedor commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1283804978 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -0,0 +1,140 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor

[GitHub] [spark] dtenedor commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-03 Thread via GitHub
dtenedor commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1283804978 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -0,0 +1,140 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-03 Thread via GitHub
allisonwang-db commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1283803283 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -0,0 +1,140 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-03 Thread via GitHub
allisonwang-db commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1283802304 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -0,0 +1,140 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more

[GitHub] [spark] ueshin commented on a diff in pull request #42302: [SPARK-44640][PYTHON] Improve error messages for Python UDTF returning non Iterable

2023-08-03 Thread via GitHub
ueshin commented on code in PR #42302: URL: https://github.com/apache/spark/pull/42302#discussion_r1283569195 ## python/pyspark/worker.py: ## @@ -599,7 +600,7 @@ def verify_result(result): raise PySparkTypeError(

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-03 Thread via GitHub
allisonwang-db commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1283791586 ## examples/src/main/python/sql/udtf.py: ## @@ -0,0 +1,169 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-03 Thread via GitHub
allisonwang-db commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1283790876 ## examples/src/main/python/sql/udtf.py: ## @@ -0,0 +1,169 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] ueshin commented on pull request #42328: [SPARK-43967][SQL][PYTHON] Add memory limits for Python UDTF analyzer

2023-08-03 Thread via GitHub
ueshin commented on PR #42328: URL: https://github.com/apache/spark/pull/42328#issuecomment-1664724038 cc @allisonwang-db @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] allisonwang-db opened a new pull request, #42329: [SPARK-44663][PYTHON] Disable arrow optimization by default for Python UDTFs

2023-08-03 Thread via GitHub
allisonwang-db opened a new pull request, #42329: URL: https://github.com/apache/spark/pull/42329 ### What changes were proposed in this pull request? This PR disables arrow optimization by default for Python UDTFs. ### Why are the changes needed? To make Python

  1   2   3   >