[GitHub] [spark] LuciferYang opened a new pull request, #40737: [SPARK-43093][SQL][TESTS] Refactor `Add a directory when spark.sql.legacy.addSingleFileInAddFile set to false` to use random directories

2023-04-10 Thread via GitHub
LuciferYang opened a new pull request, #40737: URL: https://github.com/apache/spark/pull/40737 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162331907 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -820,11 +821,13 @@ case class Divide( } private lazy val

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162330879 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -820,11 +821,13 @@ case class Divide( } private lazy val

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162330656 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala: ## @@ -273,7 +274,8 @@ case class DecimalDivideWithOverflowCheck(

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162330498 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -820,11 +821,13 @@ case class Divide( } private lazy val

[GitHub] [spark] WweiL commented on pull request #40691: [SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming

2023-04-10 Thread via GitHub
WweiL commented on PR #40691: URL: https://github.com/apache/spark/pull/40691#issuecomment-1502715640 Hi @HyukjinKwon could you please take another look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] pengzhon-db opened a new pull request, #40736: [SPARK-43084] [SS] Add applyInPandasWithState support for spark connect

2023-04-10 Thread via GitHub
pengzhon-db opened a new pull request, #40736: URL: https://github.com/apache/spark/pull/40736 ### What changes were proposed in this pull request? This change adds applyInPandasWithState support for Spark connect. Example (try with local mode `./bin/pyspark --remote

[GitHub] [spark] LuciferYang opened a new pull request, #40735: [SPARK-43092][CONNECT] Clean up unimplemented `dropDuplicatesWithinWatermark` series functions from `Dataset`

2023-04-10 Thread via GitHub
LuciferYang opened a new pull request, #40735: URL: https://github.com/apache/spark/pull/40735 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162318834 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala: ## @@ -273,7 +274,8 @@ case class DecimalDivideWithOverflowCheck(

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162318609 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -820,11 +821,13 @@ case class Divide( } private lazy val

[GitHub] [spark] LuciferYang commented on pull request #40721: [SPARK-43080][BUILD] Upgrade `zstd-jni` to 1.5.5-1

2023-04-10 Thread via GitHub
LuciferYang commented on PR #40721: URL: https://github.com/apache/spark/pull/40721#issuecomment-1502708533 > New results look reasonable. I have been in a team meeting this morning. It seems that the results of `ZStandardBenchmark` are somewhat related to the CPU model. --

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162310377 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -821,10 +822,11 @@ case class Divide( private lazy val div:

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162309743 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -821,10 +822,11 @@ case class Divide( private lazy val div:

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162309432 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -821,10 +822,11 @@ case class Divide( private lazy val div:

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162288743 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -176,6 +186,23 @@ trait FileFormat { * By default all

[GitHub] [spark] aokolnychyi commented on pull request #40734: [SPARK-43088][SQL] Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread via GitHub
aokolnychyi commented on PR #40734: URL: https://github.com/apache/spark/pull/40734#issuecomment-1502642472 @huaxingao @cloud-fan @dongjoon-hyun @sunchao @viirya @gengliangwang, could you take a look at the approach used in this PR and let me know what you think? If it seems reasonable,

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40734: [SPARK-43088][SQL] Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread via GitHub
aokolnychyi commented on code in PR #40734: URL: https://github.com/apache/spark/pull/40734#discussion_r1162267054 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2StageTables.scala: ## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] cloud-fan commented on a diff in pull request #40707: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40707: URL: https://github.com/apache/spark/pull/40707#discussion_r1162266420 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -929,6 +929,13 @@ private[spark] class TaskSetManager( info.id,

[GitHub] [spark] cloud-fan commented on a diff in pull request #40707: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40707: URL: https://github.com/apache/spark/pull/40707#discussion_r1162266204 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -929,6 +929,13 @@ private[spark] class TaskSetManager( info.id,

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40734: [SPARK-43088][SQL] Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread via GitHub
aokolnychyi commented on code in PR #40734: URL: https://github.com/apache/spark/pull/40734#discussion_r1162265053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala: ## @@ -184,19 +175,23 @@ class DataSourceV2Strategy(session:

[GitHub] [spark] cloud-fan commented on a diff in pull request #40707: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40707: URL: https://github.com/apache/spark/pull/40707#discussion_r1162264814 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -929,6 +929,13 @@ private[spark] class TaskSetManager( info.id,

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40734: [SPARK-43088][SQL] Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread via GitHub
aokolnychyi commented on code in PR #40734: URL: https://github.com/apache/spark/pull/40734#discussion_r1162264468 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala: ## @@ -99,16 +100,6 @@ class DataSourceV2Strategy(session:

[GitHub] [spark] aokolnychyi opened a new pull request, #40734: [SPARK-43088][SQL] Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread via GitHub
aokolnychyi opened a new pull request, #40734: URL: https://github.com/apache/spark/pull/40734 ### What changes were proposed in this pull request? This PR moves table staging during CTAS/RTAS into the optimizer so that the `V2Writes` rule would distribute and order

[GitHub] [spark] cloud-fan commented on a diff in pull request #40707: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40707: URL: https://github.com/apache/spark/pull/40707#discussion_r1162263526 ## core/src/main/scala/org/apache/spark/SparkException.scala: ## @@ -355,3 +355,24 @@ private[spark] class SparkSQLFeatureNotSupportedException( override def

[GitHub] [spark] HyukjinKwon closed pull request #40733: [SPARK-43089][CONNECT] Redact debug string in UI

2023-04-10 Thread via GitHub
HyukjinKwon closed pull request #40733: [SPARK-43089][CONNECT] Redact debug string in UI URL: https://github.com/apache/spark/pull/40733 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #40733: [SPARK-43089][CONNECT] Redact debug string in UI

2023-04-10 Thread via GitHub
HyukjinKwon commented on PR #40733: URL: https://github.com/apache/spark/pull/40733#issuecomment-1502632324 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] warrenzhu25 commented on pull request #40730: [SPARK-43086][CORE] Support bin pack task scheduling on executors

2023-04-10 Thread via GitHub
warrenzhu25 commented on PR #40730: URL: https://github.com/apache/spark/pull/40730#issuecomment-1502631790 > I understand the intention but there is a chance of instability due to `OutOfDisk` and sometimes `OutOfMemory`. In addition, bin-packed executors could work slower due to the

[GitHub] [spark] cloud-fan commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162257020 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -176,6 +186,23 @@ trait FileFormat { * By default all field name is

[GitHub] [spark] wangyum commented on pull request #40731: [SPARK-43087][SQL] Support coalesce buckets in join in AQE

2023-04-10 Thread via GitHub
wangyum commented on PR #40731: URL: https://github.com/apache/spark/pull/40731#issuecomment-1502627560 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] cloud-fan commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162254984 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -176,6 +186,23 @@ trait FileFormat { * By default all field name is

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162254891 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] cloud-fan commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162251053 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -554,6 +555,31 @@ object FileSourceMetadataAttribute {

[GitHub] [spark] yaooqinn commented on pull request #40718: [SPARK-43077][SQL] Improve the error message of UNRECOGNIZED_SQL_TYPE

2023-04-10 Thread via GitHub
yaooqinn commented on PR #40718: URL: https://github.com/apache/spark/pull/40718#issuecomment-1502613838 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] yaooqinn closed pull request #40718: [SPARK-43077][SQL] Improve the error message of UNRECOGNIZED_SQL_TYPE

2023-04-10 Thread via GitHub
yaooqinn closed pull request #40718: [SPARK-43077][SQL] Improve the error message of UNRECOGNIZED_SQL_TYPE URL: https://github.com/apache/spark/pull/40718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162236929 ## python/pyspark/ml/torch/distributor.py: ## @@ -744,7 +814,99 @@ def run(self, train_object: Union[Callable, str], *args: Any) -> Optional[Any]:

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162236025 ## python/pyspark/ml/torch/distributor.py: ## @@ -744,7 +814,99 @@ def run(self, train_object: Union[Callable, str], *args: Any) -> Optional[Any]:

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162235246 ## python/pyspark/ml/tests/connect/test_parity_torch_data_loader.py: ## @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] [spark] rithwik-db commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
rithwik-db commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162233821 ## python/pyspark/ml/torch/distributor.py: ## @@ -744,7 +814,99 @@ def run(self, train_object: Union[Callable, str], *args: Any) -> Optional[Any]:

[GitHub] [spark] rithwik-db commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
rithwik-db commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162233394 ## python/pyspark/ml/torch/distributor.py: ## @@ -744,7 +814,99 @@ def run(self, train_object: Union[Callable, str], *args: Any) -> Optional[Any]:

[GitHub] [spark] rithwik-db commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
rithwik-db commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162233394 ## python/pyspark/ml/torch/distributor.py: ## @@ -744,7 +814,99 @@ def run(self, train_object: Union[Callable, str], *args: Any) -> Optional[Any]:

[GitHub] [spark] dongjoon-hyun closed pull request #40723: [SPARK-43090][CONNECT][TESTS] Move `withTable` from `RemoteSparkSession` to `SQLHelper`

2023-04-10 Thread via GitHub
dongjoon-hyun closed pull request #40723: [SPARK-43090][CONNECT][TESTS] Move `withTable` from `RemoteSparkSession` to `SQLHelper` URL: https://github.com/apache/spark/pull/40723 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] LuciferYang commented on pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread via GitHub
LuciferYang commented on PR #40726: URL: https://github.com/apache/spark/pull/40726#issuecomment-1502587494 late LGTM ~ Thanks @dongjoon-hyun and all ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] LuciferYang commented on pull request #40723: [SPARK-43090][CONNECT][TESTS] Move `withTable` from `RemoteSparkSession` to `SQLHelper`

2023-04-10 Thread via GitHub
LuciferYang commented on PR #40723: URL: https://github.com/apache/spark/pull/40723#issuecomment-1502586204 > Could you file a JIRA for this, @LuciferYang ? This contribution looks enough to have a JIRA issue. @dongjoon-hyun thanks for your suggestion ~ created SPARK-43090 -- This

[GitHub] [spark] rithwik-db commented on a diff in pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread via GitHub
rithwik-db commented on code in PR #40724: URL: https://github.com/apache/spark/pull/40724#discussion_r1162227496 ## python/pyspark/ml/tests/connect/test_parity_torch_data_loader.py: ## @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +#

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162225985 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162225985 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] yaooqinn commented on a diff in pull request #40718: [SPARK-43077][SQL] Improve the error message of UNRECOGNIZED_SQL_TYPE

2023-04-10 Thread via GitHub
yaooqinn commented on code in PR #40718: URL: https://github.com/apache/spark/pull/40718#discussion_r1162219705 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -177,68 +177,56 @@ object JdbcUtils extends Logging with

[GitHub] [spark] HyukjinKwon opened a new pull request, #40733: [SPARK-43089][CONNECT] Redact debug string in UI

2023-04-10 Thread via GitHub
HyukjinKwon opened a new pull request, #40733: URL: https://github.com/apache/spark/pull/40733 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/40603 which redacts the debug string shown in UI. ### Why are the

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162212156 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162212156 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] HyukjinKwon commented on pull request #40603: [MINOR][CONNECT] Adding Proto Debug String to Job Description.

2023-04-10 Thread via GitHub
HyukjinKwon commented on PR #40603: URL: https://github.com/apache/spark/pull/40603#issuecomment-1502562037 Let me make a PR to redact it for now at least. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] HyukjinKwon commented on pull request #40603: [MINOR][CONNECT] Adding Proto Debug String to Job Description.

2023-04-10 Thread via GitHub
HyukjinKwon commented on PR #40603: URL: https://github.com/apache/spark/pull/40603#issuecomment-1502561858 Actually it would also have a security concern as it exposes the local data as is. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] dongjoon-hyun commented on pull request #40685: [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40685: URL: https://github.com/apache/spark/pull/40685#issuecomment-1502557611 Ya, I think so too~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] dongjoon-hyun commented on pull request #40685: [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40685: URL: https://github.com/apache/spark/pull/40685#issuecomment-1502558052 This patch can wait for Apache Spark 3.4.1 and 3.3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] amaliujia commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
amaliujia commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162212156 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] cloud-fan commented on pull request #40685: [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions

2023-04-10 Thread via GitHub
cloud-fan commented on PR #40685: URL: https://github.com/apache/spark/pull/40685#issuecomment-1502553970 Since it's not a regression, we don't need to block 3.4 either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162208044 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] cloud-fan commented on a diff in pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-10 Thread via GitHub
cloud-fan commented on code in PR #40693: URL: https://github.com/apache/spark/pull/40693#discussion_r1162208044 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -902,152 +903,191 @@ case class Cast( } // LongConverter -

[GitHub] [spark] xinrong-meng commented on pull request #40725: [SPARK-43082][Connect][PYTHON] Arrow-optimized Python UDFs in Spark Connect

2023-04-10 Thread via GitHub
xinrong-meng commented on PR #40725: URL: https://github.com/apache/spark/pull/40725#issuecomment-1502530555 CI failed because of ``` Run echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV fatal: detected dubious ownership in repository at '/__w/spark/spark' To add an

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
zhengruifeng commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162185540 ## python/pyspark/ml/torch/distributor.py: ## @@ -548,12 +560,23 @@ def set_torch_config(context: "BarrierTaskContext") -> None:

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162186452 ## python/pyspark/ml/torch/tests/test_distributor.py: ## @@ -328,11 +328,11 @@ def test_get_num_tasks_locally(self) -> None: def

[GitHub] [spark] dtenedor opened a new pull request, #40732: [WIP][SPARK-43085][SQL] Support column DEFAULT assignment for multi-part table names

2023-04-10 Thread via GitHub
dtenedor opened a new pull request, #40732: URL: https://github.com/apache/spark/pull/40732 ### What changes were proposed in this pull request? This PR adds support for column DEFAULT assignment for multi-part table names. ### Why are the changes needed? Spark SQL

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162185759 ## python/pyspark/ml/torch/distributor.py: ## @@ -535,12 +555,23 @@ def set_torch_config(context: "BarrierTaskContext") -> None:

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
zhengruifeng commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162185540 ## python/pyspark/ml/torch/distributor.py: ## @@ -548,12 +560,23 @@ def set_torch_config(context: "BarrierTaskContext") -> None:

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162182017 ## python/pyspark/ml/torch/distributor.py: ## @@ -548,12 +560,23 @@ def set_torch_config(context: "BarrierTaskContext") -> None:

[GitHub] [spark] github-actions[bot] closed pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate

2023-04-10 Thread via GitHub
github-actions[bot] closed pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate URL: https://github.com/apache/spark/pull/37348 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162184392 ## python/pyspark/ml/torch/distributor.py: ## @@ -150,8 +158,18 @@ def __init__( local_mode: bool = True, use_gpu: bool = True, ): -

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162183831 ## python/pyspark/ml/torch/distributor.py: ## @@ -501,6 +517,10 @@ def _get_spark_task_function( input_params = self.input_params driver_address

[GitHub] [spark] zhengruifeng commented on pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
zhengruifeng commented on PR #40695: URL: https://github.com/apache/spark/pull/40695#issuecomment-1502498964 @grundprinzip would you mind taking another look at the changes in protos? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-10 Thread via GitHub
WeichenXu123 commented on code in PR #40695: URL: https://github.com/apache/spark/pull/40695#discussion_r1162182017 ## python/pyspark/ml/torch/distributor.py: ## @@ -548,12 +560,23 @@ def set_torch_config(context: "BarrierTaskContext") -> None:

[GitHub] [spark] dongjoon-hyun commented on pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40726: URL: https://github.com/apache/spark/pull/40726#issuecomment-1502484690 Oh, it was intentional https://github.com/apache/spark/pull/40726#pullrequestreview-1378012264, but thank you! Thank you, @HyukjinKwon and @viirya ! -- This is an

[GitHub] [spark] HyukjinKwon commented on pull request #40689: [SPARK-42951][SS][Connect] DataStreamReader APIs

2023-04-10 Thread via GitHub
HyukjinKwon commented on PR #40689: URL: https://github.com/apache/spark/pull/40689#issuecomment-1502484037 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon closed pull request #40689: [SPARK-42951][SS][Connect] DataStreamReader APIs

2023-04-10 Thread via GitHub
HyukjinKwon closed pull request #40689: [SPARK-42951][SS][Connect] DataStreamReader APIs URL: https://github.com/apache/spark/pull/40689 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread via GitHub
HyukjinKwon commented on PR #40726: URL: https://github.com/apache/spark/pull/40726#issuecomment-1502483678 Hm, for some reasons, it shows @LuciferYang as a primary author. I manually changed it to @dongjoon-hyun. -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] HyukjinKwon closed pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread via GitHub
HyukjinKwon closed pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6 URL: https://github.com/apache/spark/pull/40726 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] HyukjinKwon commented on pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread via GitHub
HyukjinKwon commented on PR #40726: URL: https://github.com/apache/spark/pull/40726#issuecomment-1502482929 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40691: [SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming

2023-04-10 Thread via GitHub
HyukjinKwon commented on code in PR #40691: URL: https://github.com/apache/spark/pull/40691#discussion_r1162175072 ## python/pyspark/sql/streaming/query.py: ## @@ -188,7 +192,7 @@ def awaitTermination(self, timeout: Optional[int] = None) -> Optional[bool]: Return

[GitHub] [spark] wangyum opened a new pull request, #40731: [SPARK-43087][SQL] Support coalesce buckets in join in AQE

2023-04-10 Thread via GitHub
wangyum opened a new pull request, #40731: URL: https://github.com/apache/spark/pull/40731 ### What changes were proposed in this pull request? This PR adds `CoalesceBucketsInJoin` to `AdaptiveSparkPlanExec.queryStagePreparationRules`. ### Why are the changes needed?

[GitHub] [spark] warrenzhu25 commented on pull request #40730: [SPARK-43086][CORE] Support bin pack task scheduling on executors

2023-04-10 Thread via GitHub
warrenzhu25 commented on PR #40730: URL: https://github.com/apache/spark/pull/40730#issuecomment-1502463722 @dongjoon-hyun @mridulm @Ngone51 Help take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] dongjoon-hyun closed pull request #40727: [SPARK-43083][SQL][TESTS] Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread via GitHub
dongjoon-hyun closed pull request #40727: [SPARK-43083][SQL][TESTS] Mark `*StateStoreSuite` as `ExtendedSQLTest` URL: https://github.com/apache/spark/pull/40727 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] dongjoon-hyun commented on pull request #40727: [SPARK-43083][SQL][TESTS] Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40727: URL: https://github.com/apache/spark/pull/40727#issuecomment-1502463385 I also confirmed the moved `*StateStoreSuite` output in the GitHub Action log on this PR. - https://github.com/dongjoon-hyun/spark/actions/runs/4661381120/jobs/8250624115

[GitHub] [spark] wangyum commented on pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.13.0

2023-04-10 Thread via GitHub
wangyum commented on PR #40555: URL: https://github.com/apache/spark/pull/40555#issuecomment-1502458598 > BTW, if you mind, please revise the PR description. > > 1. Removing `Maybe it can improve read performance.` from the PR description. > 2. Coping [[SPARK-42926][BUILD][SQL]

[GitHub] [spark] warrenzhu25 opened a new pull request, #40730: [SPARK-43086][CORE] Support bin pack task scheduling on executors

2023-04-10 Thread via GitHub
warrenzhu25 opened a new pull request, #40730: URL: https://github.com/apache/spark/pull/40730 ### What changes were proposed in this pull request? Support bin pack task scheduling on executors. This is controlled by `spark.scheduler.binPack.enabled` ### Why are the changes

[GitHub] [spark] dongjoon-hyun commented on pull request #40727: [SPARK-43083][SQL][TESTS] Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40727: URL: https://github.com/apache/spark/pull/40727#issuecomment-1502456159 Thank you, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] gengliangwang closed pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-10 Thread via GitHub
gengliangwang closed pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation URL: https://github.com/apache/spark/pull/40710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] gengliangwang commented on pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-10 Thread via GitHub
gengliangwang commented on PR #40710: URL: https://github.com/apache/spark/pull/40710#issuecomment-1502451369 Thanks, merging to master/3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162154097 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala: ## @@ -23,11 +23,30 @@ import

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162154097 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala: ## @@ -23,11 +23,30 @@ import

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162151522 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -554,6 +554,28 @@ object

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

2023-04-10 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40677: URL: https://github.com/apache/spark/pull/40677#discussion_r1162151522 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -554,6 +554,28 @@ object

[GitHub] [spark] dongjoon-hyun commented on pull request #40727: [SPARK-43083][SQL][TESTS] Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40727: URL: https://github.com/apache/spark/pull/40727#issuecomment-1502430516 Could you review this PR when you have some time, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] dongjoon-hyun commented on pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.13.0

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40555: URL: https://github.com/apache/spark/pull/40555#issuecomment-1502429680 BTW, if you mind, please revise the PR description. 1. Removing `Maybe it can improve read performance.` from the PR description. 2. Coping

[GitHub] [spark] dongjoon-hyun commented on pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.13.0

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40555: URL: https://github.com/apache/spark/pull/40555#issuecomment-1502428223 Thank you for the confirmation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] wangyum commented on pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.13.0

2023-04-10 Thread via GitHub
wangyum commented on PR #40555: URL: https://github.com/apache/spark/pull/40555#issuecomment-1502427643 @dongjoon-hyun Yes. It's no noticeable significant perf difference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] dongjoon-hyun commented on pull request #40726: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40726: URL: https://github.com/apache/spark/pull/40726#issuecomment-1502427149 Could you review this PR, @viirya ? I verified manually. ``` $ ls -alt total 67688 -rw-r--r--@ 1 dongjoon staff 1955 Apr 10 15:27 maven-metadata-local.xml

[GitHub] [spark] zhenlineo opened a new pull request, #40729: [WIP][CONNECT] Adding groupByKey + mapGroup functions

2023-04-10 Thread via GitHub
zhenlineo opened a new pull request, #40729: URL: https://github.com/apache/spark/pull/40729 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] dongjoon-hyun commented on pull request #40687: [SPARK-43052][CORE] Handle stacktrace with null file name in event log

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40687: URL: https://github.com/apache/spark/pull/40687#issuecomment-1502424431 Thank you for your answers, @warrenzhu25 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] warrenzhu25 commented on pull request #40687: [SPARK-43052][CORE] Handle stacktrace with null file name in event log

2023-04-10 Thread via GitHub
warrenzhu25 commented on PR #40687: URL: https://github.com/apache/spark/pull/40687#issuecomment-1502423743 > Do you happen to know when this bug starts, @warrenzhu25 ? Sorry, I have no idea. It's 1st time I have seen this. -- This is an automated message from the Apache Git

[GitHub] [spark] dongjoon-hyun commented on pull request #40687: [SPARK-43052][CORE] Handle stacktrace with null file name in event log

2023-04-10 Thread via GitHub
dongjoon-hyun commented on PR #40687: URL: https://github.com/apache/spark/pull/40687#issuecomment-1502418739 Do you happen to know when this bug starts, @warrenzhu25 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] warrenzhu25 commented on pull request #40687: [SPARK-43052][CORE] Handle stacktrace with null file name in event log

2023-04-10 Thread via GitHub
warrenzhu25 commented on PR #40687: URL: https://github.com/apache/spark/pull/40687#issuecomment-1502409840 > BTW, according to JIRA, is this a regression at Apache Spark 3.3.2, @warrenzhu25 ? I don't think so. -- This is an automated message from the Apache Git Service. To

  1   2   3   >