[GitHub] [spark] cloud-fan commented on pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command

2022-11-01 Thread GitBox
cloud-fan commented on PR #38404: URL: https://github.com/apache/spark/pull/38404#issuecomment-1299658043 seems there is a test failure ``` SQLQueryTestSuite.interval.sql org.scalatest.exceptions.TestFailedException: interval.sql Expected "org.apache.spark.[SparkArithmeticExceptio

[GitHub] [spark] lyy-pineapple commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2022-11-01 Thread GitBox
lyy-pineapple commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1299657792 > How much confidence do we have in joni? Is it widely adopted by other open-source projects? I'm a bit concerned about moving away from JDK regex and picking a project that I just

[GitHub] [spark] cloud-fan commented on a diff in pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-01 Thread GitBox
cloud-fan commented on code in PR #38475: URL: https://github.com/apache/spark/pull/38475#discussion_r1011251487 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -250,3 +251,15 @@ message SubqueryAlias { // Optional. Qualifier of the alias. repea

[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-11-01 Thread GitBox
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1011250306 ## python/pyspark/worker.py: ## @@ -146,7 +146,74 @@ def verify_result_type(result): ) -def wrap_cogrouped_map_pandas_udf(f, return_type, argspec): +def verif

[GitHub] [spark] jerrypeng commented on pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
jerrypeng commented on PR #38430: URL: https://github.com/apache/spark/pull/38430#issuecomment-1299643015 @HeartSaVioR @LuciferYang thank you for the review. I have addressed your comments. PTAL. Thank in advance! -- This is an automated message from the Apache Git Service. To respond t

[GitHub] [spark] jerrypeng commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
jerrypeng commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1011241156 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -19,7 +19,9 @@ package org.apache.spark.sql.execution.streaming im

[GitHub] [spark] mridulm commented on pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics

2022-11-01 Thread GitBox
mridulm commented on PR #36165: URL: https://github.com/apache/spark/pull/36165#issuecomment-1299640509 +CC @zhouyejoe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

[GitHub] [spark] jerrypeng commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
jerrypeng commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1011239497 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -64,6 +67,17 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSe

[GitHub] [spark] mridulm commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

2022-11-01 Thread GitBox
mridulm commented on PR #38064: URL: https://github.com/apache/spark/pull/38064#issuecomment-1299638357 Can you pls take a look at the build failure @liuzqt ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-01 Thread GitBox
zhengruifeng commented on code in PR #38475: URL: https://github.com/apache/spark/pull/38475#discussion_r1011238884 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -123,6 +125,24 @@ class SparkConnectPlanner(plan: proto.R

[GitHub] [spark] jerrypeng commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
jerrypeng commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1011238096 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](spa

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
HeartSaVioR commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1011237000 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -64,6 +67,17 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](spark

[GitHub] [spark] zhengruifeng commented on pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-01 Thread GitBox
zhengruifeng commented on PR #38475: URL: https://github.com/apache/spark/pull/38475#issuecomment-1299635038 if we try to implement it in the client side, another problem is that it's likely to reuse and depend on some functionality in `pyspark/sql` -- This is an automated message from th

[GitHub] [spark] LuciferYang commented on pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
LuciferYang commented on PR #38476: URL: https://github.com/apache/spark/pull/38476#issuecomment-1299634288 > Sorry for the late reply. I want to know why GA doesn't have this issue? master CI always seems healthy, how can we reproduce this? Let me investigate this. Run `dev/sbt-chec

[GitHub] [spark] mridulm commented on pull request #38467: [SPARK-40987][CORE] Avoid creating a directory when deleting a block, causing DAGScheduler to not work

2022-11-01 Thread GitBox
mridulm commented on PR #38467: URL: https://github.com/apache/spark/pull/38467#issuecomment-1299634164 If we are making this change, there are a bunch of other places which are candidates for `needCreate = false` - can we include those as well ? -- This is an automated message from the A

[GitHub] [spark] HyukjinKwon commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299633920 Maybe putting it to the top model level (`connector/connect/README.md`) for now could be a good idea (?). Just wanted to avoid a different structure compared to other compoenents (`co

[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-01 Thread GitBox
mridulm commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1299631344 For cases like this, it might actually be better to add the node to deny list and fail the task to recompute the parent stage ? -- This is an automated message from the Apache Git Servi

[GitHub] [spark] mridulm commented on pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in KryoDeserializationStream

2022-11-01 Thread GitBox
mridulm commented on PR #38428: URL: https://github.com/apache/spark/pull/38428#issuecomment-1299630091 The PR as such looks reasonable to me - can we add a test to explicitly test for EOF behavior ? +CC @JoshRosen who had worked on this in the distant past :-) +CC @Ngone51

[GitHub] [spark] grundprinzip commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
grundprinzip commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299629632 What about we link to it from the top level Readme in the component? The reason why it's not in the code is because it's client language agnostic. -- This is an automated mes

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
HeartSaVioR commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1011232238 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -19,7 +19,9 @@ package org.apache.spark.sql.execution.streaming

[GitHub] [spark] zhengruifeng commented on pull request #38471: [SPARK-40883][CONNECT][FOLLOW-UP] Range.step is required and Python client should have a default value=1

2022-11-01 Thread GitBox
zhengruifeng commented on PR #38471: URL: https://github.com/apache/spark/pull/38471#issuecomment-1299626924 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] zhengruifeng closed pull request #38471: [SPARK-40883][CONNECT][FOLLOW-UP] Range.step is required and Python client should have a default value=1

2022-11-01 Thread GitBox
zhengruifeng closed pull request #38471: [SPARK-40883][CONNECT][FOLLOW-UP] Range.step is required and Python client should have a default value=1 URL: https://github.com/apache/spark/pull/38471 -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] mridulm commented on a diff in pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in KryoDeserializationStream

2022-11-01 Thread GitBox
mridulm commented on code in PR #38428: URL: https://github.com/apache/spark/pull/38428#discussion_r1011230912 ## core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala: ## @@ -504,44 +505,31 @@ class ExternalAppendOnlyMap[K, V, C]( * If no more p

[GitHub] [spark] HyukjinKwon commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299624291 For developer documentation, it might better be placed under sources as a comment e.g., `packages.scala`. e.g.) https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org

[GitHub] [spark] mridulm commented on a diff in pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in KryoDeserializationStream

2022-11-01 Thread GitBox
mridulm commented on code in PR #38428: URL: https://github.com/apache/spark/pull/38428#discussion_r1011229842 ## core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala: ## @@ -301,15 +300,18 @@ class KryoDeserializationStream( private[this] var kryo: Kryo = s

[GitHub] [spark] dongjoon-hyun commented on pull request #38474: [SPARK-40991][PYTHON] Update `cloudpickle` to v2.2.0

2022-11-01 Thread GitBox
dongjoon-hyun commented on PR #38474: URL: https://github.com/apache/spark/pull/38474#issuecomment-1299623034 Thank you so much, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] mridulm commented on a diff in pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in KryoDeserializationStream

2022-11-01 Thread GitBox
mridulm commented on code in PR #38428: URL: https://github.com/apache/spark/pull/38428#discussion_r1011229019 ## core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala: ## @@ -324,6 +326,36 @@ class KryoDeserializationStream( } } } + + final overri

[GitHub] [spark] mridulm commented on a diff in pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in KryoDeserializationStream

2022-11-01 Thread GitBox
mridulm commented on code in PR #38428: URL: https://github.com/apache/spark/pull/38428#discussion_r1011229019 ## core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala: ## @@ -324,6 +326,36 @@ class KryoDeserializationStream( } } } + + final overri

[GitHub] [spark] HyukjinKwon closed pull request #38474: [SPARK-40991][PYTHON] Update `cloudpickle` to v2.2.0

2022-11-01 Thread GitBox
HyukjinKwon closed pull request #38474: [SPARK-40991][PYTHON] Update `cloudpickle` to v2.2.0 URL: https://github.com/apache/spark/pull/38474 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[GitHub] [spark] HyukjinKwon commented on pull request #38474: [SPARK-40991][PYTHON] Update `cloudpickle` to v2.2.0

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38474: URL: https://github.com/apache/spark/pull/38474#issuecomment-1299619944 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] mridulm commented on pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite

2022-11-01 Thread GitBox
mridulm commented on PR #38371: URL: https://github.com/apache/spark/pull/38371#issuecomment-1299617022 Merged to master, thanks or fixing this @JiexingLi ! Thanks for looking into this @HyukjinKwon :-) -- This is an automated message from the Apache Git Service. To respond to the messa

[GitHub] [spark] asfgit closed pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite

2022-11-01 Thread GitBox
asfgit closed pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite URL: https://github.com/apache/spark/pull/38371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] panbingkun commented on pull request #38463: [SPARK-40374][SQL] Migrate type check failures of type creators onto error classes

2022-11-01 Thread GitBox
panbingkun commented on PR #38463: URL: https://github.com/apache/spark/pull/38463#issuecomment-1299615584 cc @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

[GitHub] [spark] mridulm commented on pull request #38377: [SPARK-40901][CORE] Unable to store Spark Driver logs with Absolute Hadoop based URI FS Path

2022-11-01 Thread GitBox
mridulm commented on PR #38377: URL: https://github.com/apache/spark/pull/38377#issuecomment-1299613189 Makes sense ... why not simply `val dfsLogFile = new Path(rootDir, appId + DRIVER_LOG_FILE_SUFFIX)` instead btw ? I am trying to see if I am missing anything here ... -- This is an a

[GitHub] [spark] grundprinzip commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
grundprinzip commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299609859 @HyukjinKwon I will add a Jira this is just the starting point to align where we want to go. My idea would be that once this is merged I will create a pr for the python clien

[GitHub] [spark] cloud-fan commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2022-11-01 Thread GitBox
cloud-fan commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1299607243 How much confidence do we have in joni? Is it widely adopted by other open-source projects? I'm a bit concerned about moving away from JDK regex and picking a project that I just heard

[GitHub] [spark] LuciferYang commented on pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
LuciferYang commented on PR #38476: URL: https://github.com/apache/spark/pull/38476#issuecomment-1299589185 Sorry for the late reply. I want to know why GA doesn't have this issue? master CI always seems healthy, how can we reproduce this? Let me investigate this. -- This is an

[GitHub] [spark] MaxGekk closed pull request #38478: [MINOR][SQL] Wrap `given` in backticks to fix compilation warning

2022-11-01 Thread GitBox
MaxGekk closed pull request #38478: [MINOR][SQL] Wrap `given` in backticks to fix compilation warning URL: https://github.com/apache/spark/pull/38478 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] MaxGekk commented on pull request #38478: [MINOR][SQL] Wrap `given` in backticks to fix compilation warning

2022-11-01 Thread GitBox
MaxGekk commented on PR #38478: URL: https://github.com/apache/spark/pull/38478#issuecomment-1299585051 +1, LGTM. Merging to master. Thank you, @LuciferYang. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

[GitHub] [spark] MaxGekk closed pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes

2022-11-01 Thread GitBox
MaxGekk closed pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes URL: https://github.com/apache/spark/pull/38438 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

[GitHub] [spark] MaxGekk commented on pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes

2022-11-01 Thread GitBox
MaxGekk commented on PR #38438: URL: https://github.com/apache/spark/pull/38438#issuecomment-1299581375 +1, LGTM. Merging to master. Thank you, @panbingkun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[GitHub] [spark] HeartSaVioR commented on pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command

2022-11-01 Thread GitBox
HeartSaVioR commented on PR #38404: URL: https://github.com/apache/spark/pull/38404#issuecomment-1299562145 (Just to remind, please update PR title and description as this PR is no longer a draft.) -- This is an automated message from the Apache Git Service. To respond to the message, ple

[GitHub] [spark] amaliujia commented on pull request #38477: [SPARK-40993][CONNECT]PYTHON[DOCS] Migrate markdown style README to PySpark Development Documentation

2022-11-01 Thread GitBox
amaliujia commented on PR #38477: URL: https://github.com/apache/spark/pull/38477#issuecomment-1299548329 cc @HyukjinKwon @grundprinzip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

[GitHub] [spark] dongjoon-hyun commented on pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
dongjoon-hyun commented on PR #38476: URL: https://github.com/apache/spark/pull/38476#issuecomment-1299540936 Oh, thank you for reverting, @linhongliu-db and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r109299 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.a

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r108516 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.a

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r108516 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.a

[GitHub] [spark] dongjoon-hyun commented on pull request #38474: [SPARK-40991][PYTHON] Update `cloudpickle` to v2.2.0

2022-11-01 Thread GitBox
dongjoon-hyun commented on PR #38474: URL: https://github.com/apache/spark/pull/38474#issuecomment-1299514591 Thank you for review, @HyukjinKwon and @itholic . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[GitHub] [spark] lyy-pineapple commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2022-11-01 Thread GitBox
lyy-pineapple commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1299505071 Add new benchmark that compared with java 11 and java 17 . cc @cloud-fan @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, ple

[GitHub] [spark] LuciferYang opened a new pull request, #38478: [MINOR][SQL] Wrap `given` in backticks to fix compilation warning

2022-11-01 Thread GitBox
LuciferYang opened a new pull request, #38478: URL: https://github.com/apache/spark/pull/38478 ### What changes were proposed in this pull request? A minor change to fix the a Scala related compilation warning ``` [WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/sp

[GitHub] [spark] amaliujia opened a new pull request, #38477: [SPARK-40993][CONNECT]PYTHON[DOCS] Migrate markdown style README to PySpark Development Documentation

2022-11-01 Thread GitBox
amaliujia opened a new pull request, #38477: URL: https://github.com/apache/spark/pull/38477 ### What changes were proposed in this pull request? This PR consolidates the development facing documentation of Spark Connect Python client into existing PySpark development doc (mor

[GitHub] [spark] LuciferYang commented on a diff in pull request #38465: [SPARK-40985][BUILD] Upgrade RoaringBitmap to 0.9.35

2022-11-01 Thread GitBox
LuciferYang commented on code in PR #38465: URL: https://github.com/apache/spark/pull/38465#discussion_r1011088808 ## core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt: ## @@ -2,12 +2,12 @@ MapStatuses Convert Benchmark

[GitHub] [spark] beliefer commented on a diff in pull request #38461: [SPARK-34079][SQL][FOLLOWUP] Improve the readability and simplify the code for MergeScalarSubqueries

2022-11-01 Thread GitBox
beliefer commented on code in PR #38461: URL: https://github.com/apache/spark/pull/38461#discussion_r1011086001 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala: ## @@ -346,25 +346,19 @@ object MergeScalarSubqueries extends Rule[

[GitHub] [spark] ulysses-you commented on pull request #36698: [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic

2022-11-01 Thread GitBox
ulysses-you commented on PR #36698: URL: https://github.com/apache/spark/pull/36698#issuecomment-1299431744 @gengliangwang it is a bug fix and also have improvement for saving unnecessary cast. The query will produce the unexpected precision and scale. before: `decimal(28,2)`, after: `decim

[GitHub] [spark] itholic commented on pull request #38474: [SPARK-40991][PYTHON] Update `cloudpickle` to v2.2.0

2022-11-01 Thread GitBox
itholic commented on PR #38474: URL: https://github.com/apache/spark/pull/38474#issuecomment-1299422328 +1 for upgrading the `cloudpickle` version -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] itholic commented on a diff in pull request #38465: [SPARK-40985][BUILD] Upgrade RoaringBitmap to 0.9.35

2022-11-01 Thread GitBox
itholic commented on code in PR #38465: URL: https://github.com/apache/spark/pull/38465#discussion_r1011048696 ## core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt: ## @@ -2,12 +2,12 @@ MapStatuses Convert Benchmark

[GitHub] [spark] HyukjinKwon commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299418917 Maybe it's better to have a JIRA. BTW, wonder if we have an e2e example for users can copy and paste to try. (e.g., like most of docs in https://spark.apache.org/docs/latest/index.htm

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
HyukjinKwon commented on code in PR #38470: URL: https://github.com/apache/spark/pull/38470#discussion_r1011045449 ## connector/connect/doc/client_connection_string.md: ## @@ -0,0 +1,110 @@ +# Connecting to Spark Connect using Clients Review Comment: The usage documentation

[GitHub] [spark] HyukjinKwon closed pull request #38473: [SPARK-40990][PYTHON] DataFrame creation from 2d NumPy array with arbitrary columns

2022-11-01 Thread GitBox
HyukjinKwon closed pull request #38473: [SPARK-40990][PYTHON] DataFrame creation from 2d NumPy array with arbitrary columns URL: https://github.com/apache/spark/pull/38473 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

[GitHub] [spark] HyukjinKwon commented on pull request #38473: [SPARK-40990][PYTHON] DataFrame creation from 2d NumPy array with arbitrary columns

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38473: URL: https://github.com/apache/spark/pull/38473#issuecomment-1299413739 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] HyukjinKwon closed pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
HyukjinKwon closed pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3" URL: https://github.com/apache/spark/pull/38476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

[GitHub] [spark] HyukjinKwon commented on pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38476: URL: https://github.com/apache/spark/pull/38476#issuecomment-1299411174 Merged to master Since this is a clean revert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HyukjinKwon closed pull request #38409: [SPARK-40930][CONNECT] Support Collect() in Python client

2022-11-01 Thread GitBox
HyukjinKwon closed pull request #38409: [SPARK-40930][CONNECT] Support Collect() in Python client URL: https://github.com/apache/spark/pull/38409 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on pull request #38409: [SPARK-40930][CONNECT] Support Collect() in Python client

2022-11-01 Thread GitBox
HyukjinKwon commented on PR #38409: URL: https://github.com/apache/spark/pull/38409#issuecomment-1299410618 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] linhongliu-db commented on pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
linhongliu-db commented on PR #38476: URL: https://github.com/apache/spark/pull/38476#issuecomment-1299401798 BTW, I really couldn't understand how this is problematic: https://github.com/sbt/sbt/compare/v1.7.2...v1.7.3 -- This is an automated message from the Apache Git Service. To respo

[GitHub] [spark] linhongliu-db commented on pull request #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
linhongliu-db commented on PR #38476: URL: https://github.com/apache/spark/pull/38476#issuecomment-1299401226 cc @LuciferYang, maybe you'll have a fix so we won't need to revert it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

[GitHub] [spark] linhongliu-db opened a new pull request, #38476: Revert "[SPARK-40976][BUILD] Upgrade sbt to 1.7.3"

2022-11-01 Thread GitBox
linhongliu-db opened a new pull request, #38476: URL: https://github.com/apache/spark/pull/38476 ### What changes were proposed in this pull request? This reverts commit 9fc3aa0b1c092ab1f13b26582e3ece7440fbfc3b. ### Why are the changes needed? The upgrade breaks `

[GitHub] [spark] github-actions[bot] commented on pull request #37259: spark-submit: throw an error when duplicate argument is provided

2022-11-01 Thread GitBox
github-actions[bot] commented on PR #37259: URL: https://github.com/apache/spark/pull/37259#issuecomment-1299388827 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] AmplabJenkins commented on pull request #38452: [SPARK-40802][SQL] Resolve JDBCRelation's schema with preparing the statement

2022-11-01 Thread GitBox
AmplabJenkins commented on PR #38452: URL: https://github.com/apache/spark/pull/38452#issuecomment-1299377919 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] AmplabJenkins commented on pull request #38453: [SPARK-40977][CONNECT][PYTHON] Complete Support for Union in Python client

2022-11-01 Thread GitBox
AmplabJenkins commented on PR #38453: URL: https://github.com/apache/spark/pull/38453#issuecomment-1299377890 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] amaliujia commented on pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-01 Thread GitBox
amaliujia commented on PR #38475: URL: https://github.com/apache/spark/pull/38475#issuecomment-1299371686 @cloud-fan This is a good example that one API can be implemented with or without a plan. Basically if we don't add a new plan to the proto, clients can still implement `

[GitHub] [spark] amaliujia opened a new pull request, #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-01 Thread GitBox
amaliujia opened a new pull request, #38475: URL: https://github.com/apache/spark/pull/38475 ### What changes were proposed in this pull request? Add `RenameColumns` to proto to support the implementation for `toDF(columnNames: String*)` which renames the input relation to a d

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1010991074 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.array

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1010991074 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.array

[GitHub] [spark] srowen commented on pull request #38469: [MINOR][BUILD] Correct the `files` contend in `checkstyle-suppressions.xml`

2022-11-01 Thread GitBox
srowen commented on PR #38469: URL: https://github.com/apache/spark/pull/38469#issuecomment-1299341398 Merged to master/3.3/3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

[GitHub] [spark] srowen closed pull request #38469: [MINOR][BUILD] Correct the `files` contend in `checkstyle-suppressions.xml`

2022-11-01 Thread GitBox
srowen closed pull request #38469: [MINOR][BUILD] Correct the `files` contend in `checkstyle-suppressions.xml` URL: https://github.com/apache/spark/pull/38469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] dongjoon-hyun opened a new pull request, #38474: [SPARK-XXX][PYTHON] Update cloudpickle to v2.2.0

2022-11-01 Thread GitBox
dongjoon-hyun opened a new pull request, #38474: URL: https://github.com/apache/spark/pull/38474 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### H

[GitHub] [spark] xinrong-meng opened a new pull request, #38473: [SPARK-40990][PYTHON] DataFrame creation from 2d NumPy array with arbitrary columns

2022-11-01 Thread GitBox
xinrong-meng opened a new pull request, #38473: URL: https://github.com/apache/spark/pull/38473 ### What changes were proposed in this pull request? Support DataFrame creation from 2d NumPy array with arbitrary columns. ### Why are the changes needed? Currently, DataFrame creatio

[GitHub] [spark] AmplabJenkins commented on pull request #38462: [SPARK-40533] [CONNECT] [PYTHON] Support most built-in literal types for Python in Spark Connect

2022-11-01 Thread GitBox
AmplabJenkins commented on PR #38462: URL: https://github.com/apache/spark/pull/38462#issuecomment-1299294257 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] AmplabJenkins commented on pull request #38463: [SPARK-40374][SQL] Migrate type check failures of type creators onto error classes

2022-11-01 Thread GitBox
AmplabJenkins commented on PR #38463: URL: https://github.com/apache/spark/pull/38463#issuecomment-1299294202 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] amaliujia commented on a diff in pull request #38409: [SPARK-40930][CONNECT] Support Collect() in Python client

2022-11-01 Thread GitBox
amaliujia commented on code in PR #38409: URL: https://github.com/apache/spark/pull/38409#discussion_r1010931805 ## python/pyspark/sql/connect/dataframe.py: ## @@ -305,8 +308,12 @@ def _print_plan(self) -> str: return self._plan.print() return "" -def

[GitHub] [spark] amaliujia commented on a diff in pull request #38409: [SPARK-40930][CONNECT] Support Collect() in Python client

2022-11-01 Thread GitBox
amaliujia commented on code in PR #38409: URL: https://github.com/apache/spark/pull/38409#discussion_r1010931805 ## python/pyspark/sql/connect/dataframe.py: ## @@ -305,8 +308,12 @@ def _print_plan(self) -> str: return self._plan.print() return "" -def

[GitHub] [spark] dtenedor commented on a diff in pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-11-01 Thread GitBox
dtenedor commented on code in PR #38418: URL: https://github.com/apache/spark/pull/38418#discussion_r1010897010 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -1001,7 +1001,13 @@ createOrReplaceTableColTypeList ; createOrRepl

[GitHub] [spark] amaliujia commented on a diff in pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-11-01 Thread GitBox
amaliujia commented on code in PR #38418: URL: https://github.com/apache/spark/pull/38418#discussion_r1010890792 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -1001,7 +1001,13 @@ createOrReplaceTableColTypeList ; createOrRep

[GitHub] [spark] kristopherkane commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering

2022-11-01 Thread GitBox
kristopherkane commented on PR #38358: URL: https://github.com/apache/spark/pull/38358#issuecomment-1299119179 Thanks for the fix! Is it possible this could land in 3.1 as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

[GitHub] [spark] grundprinzip commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
grundprinzip commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299084859 Good point, I will incorporate that into the doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[GitHub] [spark] anchovYu commented on pull request #38169: [SPARK-40663][SQL] Migrate execution errors onto error classes: _LEGACY_ERROR_TEMP_2176-2220

2022-11-01 Thread GitBox
anchovYu commented on PR #38169: URL: https://github.com/apache/spark/pull/38169#issuecomment-1299073908 the title needs to be updated from 2220 to 2200 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] amaliujia commented on pull request #38470: [CONNECT] [DOC] Defining Spark Connect Client Connection String

2022-11-01 Thread GitBox
amaliujia commented on PR #38470: URL: https://github.com/apache/spark/pull/38470#issuecomment-1299054036 Overall LGTM Is the `user_id` (or the user session token) be relevant to this doc?https://github.com/apache/spark/blob/8f6b18536e44ffd36656ceb56a434e399ad6d1b8/python/pyspark/sql/

[GitHub] [spark] amaliujia commented on pull request #38472: [SPARK-40989][CONNECT][PYTHON][TESTS] Improve `session.sql` testing coverage in Python client

2022-11-01 Thread GitBox
amaliujia commented on PR #38472: URL: https://github.com/apache/spark/pull/38472#issuecomment-1299035641 R: @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] amaliujia opened a new pull request, #38472: [SPARK-40989][CONNECT][PYTHON][TESTS] Improve `session.sql` testing coverage in Python client

2022-11-01 Thread GitBox
amaliujia opened a new pull request, #38472: URL: https://github.com/apache/spark/pull/38472 ### What changes were proposed in this pull request? This PR tests `session.sql` in Python client both in `toProto` path and the data collection path. ### Why are the change

[GitHub] [spark] gengliangwang commented on pull request #36698: [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic

2022-11-01 Thread GitBox
gengliangwang commented on PR #36698: URL: https://github.com/apache/spark/pull/36698#issuecomment-1299022752 @ulysses-you Is the following query an actual bug before the refactor? Or did the refactor just remove the redundant cast? ``` SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL

[GitHub] [spark] amaliujia opened a new pull request, #38471: [SC-114545][SPARK-40883][CONNECT] Range.step is required and Python client should have a default value=1

2022-11-01 Thread GitBox
amaliujia opened a new pull request, #38471: URL: https://github.com/apache/spark/pull/38471 ### What changes were proposed in this pull request? To match existing Python DataFarme API, this PR changes the `Range.step` as required and Python client keep `1` as a default value

[GitHub] [spark] amaliujia commented on pull request #38471: [SC-114545][SPARK-40883][CONNECT] Range.step is required and Python client should have a default value=1

2022-11-01 Thread GitBox
amaliujia commented on PR #38471: URL: https://github.com/apache/spark/pull/38471#issuecomment-1299015533 R: @zhengruifeng I sent out this PR based on your suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] carlfu-db commented on pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command

2022-11-01 Thread GitBox
carlfu-db commented on PR #38404: URL: https://github.com/apache/spark/pull/38404#issuecomment-1298960790 https://user-images.githubusercontent.com/114777395/199313517-3122d622-ba62-4ac5-8fbf-d01b4e59c394.png";> I have rebase the PR on to the latest apache/master, not sure how to trigg

[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-11-01 Thread GitBox
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1010721467 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -178,46 +176,73 @@ private[sql] object ProtobufUtils extends

[GitHub] [spark] jerrypeng commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-11-01 Thread GitBox
jerrypeng commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010692304 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](spa

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1010663824 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.array

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1010663824 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.array

[GitHub] [spark] MaxGekk commented on a diff in pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes

2022-11-01 Thread GitBox
MaxGekk commented on code in PR #38438: URL: https://github.com/apache/spark/pull/38438#discussion_r1010683264 ## sql/core/src/test/java/test/org/apache/spark/sql/JavaColumnExpressionSuite.java: ## @@ -79,12 +83,16 @@ public void isInCollectionCheckExceptionMessage() { cr

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-01 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1010681773 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,474 @@ def array_to_vector(col: Column) -> Column: return Column(sc._jvm.org.apache.spark.ml.functions.array

  1   2   >