Re: [PR] [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup [spark]

2023-11-06 Thread via GitHub
EnricoMi commented on code in PR #38624: URL: https://github.com/apache/spark/pull/38624#discussion_r1384512107 ## python/pyspark/worker.py: ## @@ -306,6 +308,33 @@ def verify_element(elem): ) +def wrap_cogrouped_map_arrow_udf(f, return_type, argspec, runner_conf):

Re: [PR] [SPARK-45798][CONNECT] Assert server-side session ID [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on code in PR #43664: URL: https://github.com/apache/spark/pull/43664#discussion_r1384511928 ## python/pyspark/sql/connect/client/core.py: ## @@ -1620,6 +1593,42 @@ def cache_artifact(self, blob: bytes) -> str: return

Re: [PR] [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId [spark]

2023-11-06 Thread via GitHub
beliefer commented on PR #43693: URL: https://github.com/apache/spark/pull/43693#issuecomment-1797975709 @yaooqinn Thank you for the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId [spark]

2023-11-06 Thread via GitHub
yaooqinn commented on PR #43693: URL: https://github.com/apache/spark/pull/43693#issuecomment-1797943499 Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-45808][CONNECT][PYTHON] Better error handling for SQL Exceptions [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on code in PR #43667: URL: https://github.com/apache/spark/pull/43667#discussion_r1384460826 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectFetchErrorDetailsHandler.scala: ## @@ -46,9 +44,7 @@ class

Re: [PR] [SPARK-45816][SQL] Return `NULL` when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384458242 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b =>

Re: [PR] [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43693: URL: https://github.com/apache/spark/pull/43693#issuecomment-1797916547 According to the `Affected Version` of JIRA, I landed to master branch only. Please feel free to backport this if you need. -- This is an automated message from the Apache Git

Re: [PR] [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun closed pull request #43693: [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId URL: https://github.com/apache/spark/pull/43693 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-45816][SQL] Return `NULL` when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384455469 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b => if (b)

Re: [PR] [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43693: URL: https://github.com/apache/spark/pull/43693#issuecomment-1797914973 I verified manually. ``` [info] CoarseGrainedExecutorBackendSuite: [info] - parsing no resources (468 milliseconds) [info] - parsing one resource (27 milliseconds)

Re: [PR] [SPARK-45816][SQL] Return `NULL` when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384455469 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b => if (b)

[PR] [WIP][SQL] Add a SQL config for extra traces in `Origin` [spark]

2023-11-06 Thread via GitHub
MaxGekk opened a new pull request, #43695: URL: https://github.com/apache/spark/pull/43695 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

Re: [PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384449877 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b =>

Re: [PR] [SPARK-45431][DOCS] Document new SSL RPC feature [spark]

2023-11-06 Thread via GitHub
mridulm commented on code in PR #43240: URL: https://github.com/apache/spark/pull/43240#discussion_r1384439623 ## docs/security.md: ## @@ -563,7 +604,52 @@ replaced with one of the above namespaces. ${ns}.trustStoreType JKS -The type of the trust store. +

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-06 Thread via GitHub
mridulm commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1797888003 @tgravescs The SparkEnv related change is what gave me pause ... I am less concerned about the Executor side of things -- This is an automated message from the Apache Git Service. To

Re: [PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384416801 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b => if (b)

Re: [PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384424672 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b => if (b)

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-06 Thread via GitHub
panbingkun commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1384419375 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -1090,6 +1090,26 @@ class SessionCatalog( dbViews ++

Re: [PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384417670 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala: ## @@ -514,9 +514,9 @@ class CastWithAnsiOffSuite extends

Re: [PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384417670 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala: ## @@ -514,9 +514,9 @@ class CastWithAnsiOffSuite extends

Re: [PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya commented on code in PR #43694: URL: https://github.com/apache/spark/pull/43694#discussion_r1384416801 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -785,17 +786,19 @@ case class Cast( buildCast[Boolean](_, b => if (b)

[PR] [SPARK-45816][SQL] Return null when overflowing during casting from timestamp to integers [spark]

2023-11-06 Thread via GitHub
viirya opened a new pull request, #43694: URL: https://github.com/apache/spark/pull/43694 ### What changes were proposed in this pull request? Spark cast works in two modes: ansi and non-ansi. When overflowing during casting, the common behavior under non-ansi mode is to

Re: [PR] [SPARK-45223][PYTHON][DOCS] Refine docstring of `Column.when` [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43688: URL: https://github.com/apache/spark/pull/43688#issuecomment-1797863219 Could you re-trigger the failed pipeline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[PR] [SPARK-45013][TEST] Flaky Test with NPE: track allocated resources by taskId [spark]

2023-11-06 Thread via GitHub
yaooqinn opened a new pull request, #43693: URL: https://github.com/apache/spark/pull/43693 ### What changes were proposed in this pull request? This PR ensures the runningTasks to be updated before subsequent tasks causing NPE ### Why are the changes

Re: [PR] [SPARK-45511][SS] State Data Source - Reader [spark]

2023-11-06 Thread via GitHub
HeartSaVioR commented on PR #43425: URL: https://github.com/apache/spark/pull/43425#issuecomment-1797817269 cc. @zsxwing @brkyvz @viirya @xuanyuanking Would you mind having a look? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-45804][UI] Add spark.ui.threadDump.flamegraphEnabled config to switch flame graph on/off [spark]

2023-11-06 Thread via GitHub
yaooqinn commented on PR #43674: URL: https://github.com/apache/spark/pull/43674#issuecomment-1797731478 Thank you, as always, @dongjoon-hyun and @HyukjinKwon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-45812][BUILD][PYTHON][PS] Upgrade Pandas to 2.1.2 [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43689: URL: https://github.com/apache/spark/pull/43689#issuecomment-1797721159 Merged to master. Thank you, @itholic . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-45812][BUILD][PYTHON][PS] Upgrade Pandas to 2.1.2 [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43689: URL: https://github.com/apache/spark/pull/43689#issuecomment-1797720993 All Python related tests passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-45812][BUILD][PYTHON][PS] Upgrade Pandas to 2.1.2 [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun closed pull request #43689: [SPARK-45812][BUILD][PYTHON][PS] Upgrade Pandas to 2.1.2 URL: https://github.com/apache/spark/pull/43689 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43691: URL: https://github.com/apache/spark/pull/43691#issuecomment-1797718785 cc @sunchao , too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-45804][UI] Add spark.ui.threadDump.flamegraphEnabled config to switch flame graph on/off [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun commented on PR #43674: URL: https://github.com/apache/spark/pull/43674#issuecomment-1797718287 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-45804][UI] Add spark.ui.threadDump.flamegraphEnabled config to switch flame graph on/off [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun closed pull request #43674: [SPARK-45804][UI] Add spark.ui.threadDump.flamegraphEnabled config to switch flame graph on/off URL: https://github.com/apache/spark/pull/43674 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak [spark]

2023-11-06 Thread via GitHub
xieshuaihu commented on PR #43691: URL: https://github.com/apache/spark/pull/43691#issuecomment-1797666888 cc @HyukjinKwon @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] [SPARK-45815][SQL][Streaming] Provide an interface for other Streaming sources to add `_metadata` columns [spark]

2023-11-06 Thread via GitHub
Yaohua628 opened a new pull request, #43692: URL: https://github.com/apache/spark/pull/43692 ### What changes were proposed in this pull request? Currently, only the native V1 file-based streaming source can read the `_metadata` column:

Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is

2023-11-06 Thread via GitHub
LuciferYang commented on PR #43670: URL: https://github.com/apache/spark/pull/43670#issuecomment-1797423492 [bb5f3d4](https://github.com/apache/spark/pull/43670/commits/bb5f3d4f96c7315d98fc9c75cbad26890dfc) fix examples part -- This is an automated message from the Apache Git

Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-06 Thread via GitHub
cloud-fan commented on code in PR #43630: URL: https://github.com/apache/spark/pull/43630#discussion_r1384308595 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -208,10 +209,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession)

Re: [PR] [SPARK-45813][CONNECT][PYTHON] Return the observed metrics from commands [spark]

2023-11-06 Thread via GitHub
beliefer commented on code in PR #43690: URL: https://github.com/apache/spark/pull/43690#discussion_r1384289821 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala: ## @@ -162,6 +162,18 @@ private[connect] class

Re: [PR] [SPARK-45798][CONNECT] Assert server-side session ID [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43664: URL: https://github.com/apache/spark/pull/43664#discussion_r1384277531 ## python/pyspark/sql/connect/client/core.py: ## @@ -1620,6 +1593,42 @@ def cache_artifact(self, blob: bytes) -> str: return

Re: [PR] [SPARK-45808][CONNECT][PYTHON] Better error handling for SQL Exceptions [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43667: URL: https://github.com/apache/spark/pull/43667#discussion_r1384270902 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectFetchErrorDetailsHandler.scala: ## @@ -46,9 +44,7 @@ class

Re: [PR] [SPARK-45811][PYTHON][DOCS] Refine docstring of `from_xml` [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43680: URL: https://github.com/apache/spark/pull/43680#discussion_r1384266732 ## python/pyspark/sql/functions.py: ## @@ -13549,6 +13549,8 @@ def json_object_keys(col: "ColumnOrName") -> Column: return

Re: [PR] [SPARK-45258][PYTHON][DOCS] Refine docstring of `sum` [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43684: URL: https://github.com/apache/spark/pull/43684#discussion_r1384265181 ## python/pyspark/sql/functions.py: ## @@ -1197,13 +1197,27 @@ def sum(col: "ColumnOrName") -> Column: Examples +Example 1: Calculating

Re: [PR] [SPARK-45259][PYTHON][DOCS] Refine docstring of `count` [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43685: URL: https://github.com/apache/spark/pull/43685#discussion_r1384263186 ## python/pyspark/sql/functions.py: ## @@ -1162,15 +1162,48 @@ def count(col: "ColumnOrName") -> Column: Examples -Count by all columns

[PR] [SPARK-45814][CONNECT][CORE]Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak [spark]

2023-11-06 Thread via GitHub
xieshuaihu opened a new pull request, #43691: URL: https://github.com/apache/spark/pull/43691 ### What changes were proposed in this pull request? Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak. ### Why are the changes needed?

Re: [PR] [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct` [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43686: URL: https://github.com/apache/spark/pull/43686#discussion_r1384260727 ## python/pyspark/sql/functions.py: ## @@ -4626,26 +4626,38 @@ def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column: Examples

Re: [PR] [SPARK-45804][UI] Add spark.ui.threadDump.flamegraphEnabled config to switch flame graph on/off [spark]

2023-11-06 Thread via GitHub
yaooqinn commented on code in PR #43674: URL: https://github.com/apache/spark/pull/43674#discussion_r1384260744 ## core/src/main/scala/org/apache/spark/internal/config/UI.scala: ## @@ -97,6 +97,12 @@ private[spark] object UI { .booleanConf .createWithDefault(true) +

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43682: URL: https://github.com/apache/spark/pull/43682#discussion_r1384255664 ## python/pyspark/worker.py: ## @@ -1057,6 +1059,9 @@ def mapper(_, it): yield from eval(*[a[o] for o in args_kwargs_offsets])

Re: [PR] [SPARK-45796][SQL] Support MODE() WITHIN GROUP (ORDER BY col) [spark]

2023-11-06 Thread via GitHub
beliefer commented on PR #43663: URL: https://github.com/apache/spark/pull/43663#issuecomment-1797146203 > We can write order by in windowNameOrSpecification, why do we need ORDER BY sortSpecification after WITHIN GROUP The scope of `ORDER BY sortSpecification` is different from

Re: [PR] [SPARK-45222][PYTHON][DOCS] Refine docstring of `DataFrameReader.json` [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43687: URL: https://github.com/apache/spark/pull/43687#discussion_r1384245409 ## python/pyspark/sql/readwriter.py: ## @@ -380,22 +380,72 @@ def json( Examples -Write a DataFrame into a JSON file and

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43682: URL: https://github.com/apache/spark/pull/43682#discussion_r1384244315 ## python/pyspark/worker.py: ## @@ -1057,6 +1059,9 @@ def mapper(_, it): yield from eval(*[a[o] for o in args_kwargs_offsets])

Re: [PR] [SPARK-45813][CONNECT][PYTHON] Return the observed metrics from commands [spark]

2023-11-06 Thread via GitHub
ueshin commented on PR #43690: URL: https://github.com/apache/spark/pull/43690#issuecomment-1797143420 cc @HyukjinKwon @zhengruifeng @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[PR] [SPARK-45813][CONNECT][PYTHON] Return the observed metrics from commands [spark]

2023-11-06 Thread via GitHub
ueshin opened a new pull request, #43690: URL: https://github.com/apache/spark/pull/43690 ### What changes were proposed in this pull request? Returns the observed metrics from commands. ### Why are the changes needed? Currently the observed metrics on commands are not

Re: [PR] [SPARK-43402][SQL] FileSourceScanExec supports push down data filter with scalar subquery [spark]

2023-11-06 Thread via GitHub
ulysses-you commented on PR #41088: URL: https://github.com/apache/spark/pull/41088#issuecomment-1797129478 @epa095 yes, updated that jira status -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-45708][BUILD] Retry mvn deploy [spark]

2023-11-06 Thread via GitHub
LuciferYang commented on PR #43559: URL: https://github.com/apache/spark/pull/43559#issuecomment-1797103474 Will adding the `- Dmaven.resolver.transport=wagon` in `MAVEN_OPTS` have any effect? The default implementation of Maven 3.9 pieces of resolver has changed from wagon to httpclient

Re: [PR] [SPARK-45222][PYTHON][DOCS] Refine docstring of `DataFrameReader.json` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on code in PR #43687: URL: https://github.com/apache/spark/pull/43687#discussion_r1384213317 ## python/pyspark/sql/readwriter.py: ## @@ -380,22 +380,72 @@ def json( Examples -Write a DataFrame into a JSON file and read

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-06 Thread via GitHub
dtenedor commented on code in PR #43682: URL: https://github.com/apache/spark/pull/43682#discussion_r1384198085 ## python/pyspark/worker.py: ## @@ -995,6 +995,8 @@ def verify_result(result): def func(*a: Any) -> Any: try:

Re: [PR] [SPARK-43242] Fix throw 'Unexpected type of BlockId' in shuffle corruption diagnose [spark]

2023-11-06 Thread via GitHub
github-actions[bot] closed pull request #40921: [SPARK-43242] Fix throw 'Unexpected type of BlockId' in shuffle corruption diagnose URL: https://github.com/apache/spark/pull/40921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-45222][PYTHON][DOCS] Refine docstring of `DataFrameReader.json` [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43687: URL: https://github.com/apache/spark/pull/43687#discussion_r1384164698 ## python/pyspark/sql/readwriter.py: ## @@ -380,22 +380,72 @@ def json( Examples -Write a DataFrame into a JSON file and

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-06 Thread via GitHub
allisonwang-db commented on code in PR #43682: URL: https://github.com/apache/spark/pull/43682#discussion_r1384161755 ## python/pyspark/worker.py: ## @@ -995,6 +995,8 @@ def verify_result(result): def func(*a: Any) -> Any: try:

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-06 Thread via GitHub
abellina commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1796931246 @tgravescs @mridulm @beliefer I made a small tweak where the `executorEnvs` map in the `SparkContext` is populated with the configuration prefix `spark.executorEnv.*` after the driver

[PR] [SPARK-45812][BUILD][PYTHON][PS] Upgrade Pandas to 2.1.2 [spark]

2023-11-06 Thread via GitHub
itholic opened a new pull request, #43689: URL: https://github.com/apache/spark/pull/43689 ### What changes were proposed in this pull request? This PR proposes to upgrade Pandas to 2.1.2. See https://pandas.pydata.org/docs/dev/whatsnew/v2.1.2.html for detail ### Why

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-06 Thread via GitHub
tgravescs commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1384004018 ## core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala: ## @@ -191,7 +191,10 @@ private[spark] class CoarseGrainedExecutorBackend(

Re: [PR] [SPARK-45223][PYTHON][DOCS] Refine docstring of `Column.when` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43688: URL: https://github.com/apache/spark/pull/43688#issuecomment-1796478652 This is last from me today :-). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] [SPARK-45223][PYTHON][DOCS] Refine docstring of `Column.when` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43688: URL: https://github.com/apache/spark/pull/43688 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `Column.when`. ### Why are the changes needed? For end users, and better

Re: [PR] [SPARK-45222][PYTHON][DOCS] Refine docstring of `DataFrameReader.json` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43687: URL: https://github.com/apache/spark/pull/43687#issuecomment-1796414232 Thank you @dongjoon-hyun !!! @allisonwang-db BTW do you plan to do this for all other functions, or some frequently used only? With my PRs, (almost) all under SPARK-44728

Re: [PR] [SPARK-45803][CORE] Remove the no longer used `RpcAbortException` [spark]

2023-11-06 Thread via GitHub
dongjoon-hyun closed pull request #43673: [SPARK-45803][CORE] Remove the no longer used `RpcAbortException` URL: https://github.com/apache/spark/pull/43673 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-06 Thread via GitHub
tgravescs commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1796408368 I agree that ideally we would finish SPARK-25299, I don't see that happening anytime soon. I also don't think it covers the case of people replacing the entire ShuffleManager vs just

Re: [PR] [SPARK-45222][PYTHON][DOCS] Refine docstring of `DataFrameReader.json` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on code in PR #43687: URL: https://github.com/apache/spark/pull/43687#discussion_r1383959348 ## python/pyspark/sql/readwriter.py: ## @@ -380,22 +380,72 @@ def json( Examples -Write a DataFrame into a JSON file and read

[PR] [SPARK-45222][PYTHON][DOCS] Refine docstring of `DataFrameReader.json` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43687: URL: https://github.com/apache/spark/pull/43687 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `DataFrameReader.json`. ### Why are the changes needed? For end users, and

[PR] [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43686: URL: https://github.com/apache/spark/pull/43686 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `count_distinct`. ### Why are the changes needed? For end users, and better

Re: [PR] [SPARK-45805][SQL] Make `withOrigin` more generic [spark]

2023-11-06 Thread via GitHub
peter-toth commented on PR #43671: URL: https://github.com/apache/spark/pull/43671#issuecomment-1796379533 Merged to `master` (4.0), thanks @MaxGekk for the fix and @HyukjinKwon for the review. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-45805][SQL] Make `withOrigin` more generic [spark]

2023-11-06 Thread via GitHub
peter-toth commented on PR #43671: URL: https://github.com/apache/spark/pull/43671#issuecomment-1796375625 @HyukjinKwon, yes, those failures seem unrelated. I'm happy to merge it, but I've tested my permissions already... ;) -- This is an automated message from the Apache Git Service. To

[PR] [SPARK-45259][PYTHON][DOCS] Refine docstring of `count` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43685: URL: https://github.com/apache/spark/pull/43685 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `count`. ### Why are the changes needed? For end users, and better usability

Re: [PR] [SPARK-45805][SQL] Make `withOrigin` more generic [spark]

2023-11-06 Thread via GitHub
peter-toth closed pull request #43671: [SPARK-45805][SQL] Make `withOrigin` more generic URL: https://github.com/apache/spark/pull/43671 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-45709][BUILD] Deploy packages when all packages are built [spark]

2023-11-06 Thread via GitHub
EnricoMi commented on PR #43561: URL: https://github.com/apache/spark/pull/43561#issuecomment-1796362720 @LuciferYang @HyukjinKwon the publish snapshot workflow keeps failing due to HTTP errors, which still causes inconsistent snapshot packages:

Re: [PR] [SPARK-45708][BUILD] Retry mvn deploy [spark]

2023-11-06 Thread via GitHub
EnricoMi commented on PR #43559: URL: https://github.com/apache/spark/pull/43559#issuecomment-1796354605 @LuciferYang @HyukjinKwon the publish snapshot workflow keeps failing due to HTTP errors: https://github.com/apache/spark/actions/workflows/publish_snapshot.yml Please consider

Re: [PR] [WIP][SPARK-45770][SQL][PYTHON][CONNECT] Introduce logical plan `UnresolvedDropColumns` for `Dataframe.drop` [spark]

2023-11-06 Thread via GitHub
zhengruifeng commented on code in PR #43683: URL: https://github.com/apache/spark/pull/43683#discussion_r1383923949 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -235,6 +235,23 @@ object Project { } } +case

[PR] [SPARK-45258][PYTHON][DOCS] Refine docstring of `sum` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43684: URL: https://github.com/apache/spark/pull/43684 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `sum`. ### Why are the changes needed? For end users, and better usability of

[PR] [WIP][SPARK-45770][SQL][PYTHON][CONNECT] Introduce logical plan `UnresolvedDropColumns` for `Dataframe.drop` [spark]

2023-11-06 Thread via GitHub
zhengruifeng opened a new pull request, #43683: URL: https://github.com/apache/spark/pull/43683 ### What changes were proposed in this pull request? Fix column resolution in DataFrame.drop ### Why are the changes needed? ``` from pyspark.sql.functions import

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-06 Thread via GitHub
dtenedor commented on PR #43682: URL: https://github.com/apache/spark/pull/43682#issuecomment-1796329902 cc @ueshin @allisonwang-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-06 Thread via GitHub
dtenedor opened a new pull request, #43682: URL: https://github.com/apache/spark/pull/43682 ### What changes were proposed in this pull request? This PR creates a Python UDTF API to stop consuming rows from the input table. If the UDTF raises a `StopIteration` exception in the

[PR] [SPARK-45186][PYTHON][DOCS] Refine docstring of `schema_of_xml` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43681: URL: https://github.com/apache/spark/pull/43681 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `schema_of_xml`. ### Why are the changes needed? For end users, and better

[PR] [SPARK-45809][PYTHON][DOCS] Refine docstring of `from_xml` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43680: URL: https://github.com/apache/spark/pull/43680 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `from_xml`. ### Why are the changes needed? For end users, and better

Re: [PR] [SPARK-45808][CONNECT][PYTHON] Better error handling for SQL Exceptions [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on code in PR #43667: URL: https://github.com/apache/spark/pull/43667#discussion_r1383889882 ## python/pyspark/errors/exceptions/connect.py: ## @@ -16,7 +16,7 @@ # import pyspark.sql.connect.proto as pb2 import json -from typing import Dict, List,

Re: [PR] [SPARK-45798][CONNECT] Assert server-side session ID [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on code in PR #43664: URL: https://github.com/apache/spark/pull/43664#discussion_r1383886945 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala: ## @@ -201,7 +201,9 @@ private[connect] class

[PR] [SPARK-45809][PYTHON][DOCS] Refine docstring of `lit` [spark]

2023-11-06 Thread via GitHub
HyukjinKwon opened a new pull request, #43679: URL: https://github.com/apache/spark/pull/43679 ### What changes were proposed in this pull request? This PR proposes to improve the docstring of `lit`. ### Why are the changes needed? For end users, and better usability of

Re: [PR] [SPARK-45798][CONNECT] Assert server-side session ID [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on code in PR #43664: URL: https://github.com/apache/spark/pull/43664#discussion_r1383865603 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala: ## @@ -179,6 +184,9 @@ class ArtifactManager( val

Re: [PR] [SPARK-45791][CONNECT][TESTS] Rename `SparkConnectSessionHodlerSuite.scala` to `SparkConnectSessionHolderSuite.scala` [spark]

2023-11-06 Thread via GitHub
rangadi commented on PR #43657: URL: https://github.com/apache/spark/pull/43657#issuecomment-1796152154 Thanks for fixing this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-45808][CONNECT][PYTHON] Better error handling for SQL Exceptions [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on code in PR #43667: URL: https://github.com/apache/spark/pull/43667#discussion_r1383863124 ## python/pyspark/errors/exceptions/connect.py: ## @@ -16,7 +16,7 @@ # import pyspark.sql.connect.proto as pb2 import json -from typing import Dict, List,

Re: [PR] [SPARK-45798][CONNECT] Assert server-side session ID [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on code in PR #43664: URL: https://github.com/apache/spark/pull/43664#discussion_r1383861338 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectBlockingStub.scala: ## @@ -18,13 +18,93 @@ package

Re: [PR] [SPARK-44751][SQL] Move `XSDToSchema` from `catalyst` to `core` package [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43652: URL: https://github.com/apache/spark/pull/43652#issuecomment-1796140022 tests passed at https://github.com/shujingyang-db/spark/actions/runs/6750039898 Merged to master. -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] [SPARK-44751][SQL] Move `XSDToSchema` from `catalyst` to `core` package [spark]

2023-11-06 Thread via GitHub
HyukjinKwon closed pull request #43652: [SPARK-44751][SQL] Move `XSDToSchema` from `catalyst` to `core` package URL: https://github.com/apache/spark/pull/43652 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE [spark]

2023-11-06 Thread via GitHub
imback82 commented on code in PR #42577: URL: https://github.com/apache/spark/pull/42577#discussion_r1383858771 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -3973,18 +4000,31 @@ class AstBuilder extends DataTypeAstBuilder with

Re: [PR] [SPARK-45808][CONNECT][PYTHON] Better error handling for SQL Exceptions [spark]

2023-11-06 Thread via GitHub
grundprinzip commented on PR #43667: URL: https://github.com/apache/spark/pull/43667#issuecomment-1796129182 Filed SPARK-45808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-XXX][CONNECT][PYTHON] Better error handling for SQL Exceptions [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43667: URL: https://github.com/apache/spark/pull/43667#issuecomment-1796120836 Oh let's also file a JIRA btw -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [MINOR][INFRA] Correct Java version in RM Dockerfile description [spark]

2023-11-06 Thread via GitHub
HyukjinKwon closed pull request #43669: [MINOR][INFRA] Correct Java version in RM Dockerfile description URL: https://github.com/apache/spark/pull/43669 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE [spark]

2023-11-06 Thread via GitHub
imback82 commented on code in PR #42577: URL: https://github.com/apache/spark/pull/42577#discussion_r1383855097 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala: ## @@ -170,6 +170,23 @@ case class CatalogTablePartition( } } +/** + * A

Re: [PR] [MINOR][INFRA] Correct Java version in RM Dockerfile description [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43669: URL: https://github.com/apache/spark/pull/43669#issuecomment-1796115085 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE [spark]

2023-11-06 Thread via GitHub
imback82 commented on code in PR #42577: URL: https://github.com/apache/spark/pull/42577#discussion_r1383854471 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala: ## @@ -253,7 +270,8 @@ case class CatalogTable( tracksPartitionsInCatalog:

Re: [PR] [SPARK-45805][SQL] Make `withOrigin` more generic [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43671: URL: https://github.com/apache/spark/pull/43671#issuecomment-1796110374 test failure seems unrelated (https://github.com/MaxGekk/spark/actions/runs/6772663701/job/18414028124). @peter-toth wanna try merging a PR? -- This is an automated message

Re: [PR] [SPARK-45786][SQL] Fix inaccurate Decimal multiplication and division results [spark]

2023-11-06 Thread via GitHub
HyukjinKwon commented on PR #43678: URL: https://github.com/apache/spark/pull/43678#issuecomment-1796094167 test: https://github.com/kazuyukitanimura/spark/actions/runs/6775292999/job/18414265284 -- This is an automated message from the Apache Git Service. To respond to the message,

  1   2   >