[GitHub] [spark] amaliujia commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117874054 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSession

[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][Core][K8S]: Cleanup k8s upload directory when job terminates

2023-02-24 Thread via GitHub
shrprasa commented on PR #40128: URL: https://github.com/apache/spark/pull/40128#issuecomment-1444968969 @holdenk @dongjoon-hyun Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[GitHub] [spark] hvanhovell commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117852573 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSessio

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117852174 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117852174 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] hvanhovell closed pull request #40145: [SPARK-42541][CONNECT] Support Pivot with provided pivot column values

2023-02-24 Thread via GitHub
hvanhovell closed pull request #40145: [SPARK-42541][CONNECT] Support Pivot with provided pivot column values URL: https://github.com/apache/spark/pull/40145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

[GitHub] [spark] hvanhovell commented on pull request #40145: [SPARK-42541][CONNECT] Support Pivot with provided pivot column values

2023-02-24 Thread via GitHub
hvanhovell commented on PR #40145: URL: https://github.com/apache/spark/pull/40145#issuecomment-1444946241 Merging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117851049 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] zhenlineo commented on pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite

2023-02-24 Thread via GitHub
zhenlineo commented on PR #40169: URL: https://github.com/apache/spark/pull/40169#issuecomment-1444937486 cc @vicennial too! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] amaliujia commented on a diff in pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40167: URL: https://github.com/apache/spark/pull/40167#discussion_r1117848868 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1906,6 +1906,100 @@ class Dataset[T] private[sql] (val session: SparkSessio

[GitHub] [spark] amaliujia commented on a diff in pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40166: URL: https://github.com/apache/spark/pull/40166#discussion_r1117829037 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -122,8 +122,8 @@ message Read { } message DataSource { -// (Required) S

[GitHub] [spark] amaliujia commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117849043 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSession

[GitHub] [spark] amaliujia commented on a diff in pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40167: URL: https://github.com/apache/spark/pull/40167#discussion_r1117848868 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1906,6 +1906,100 @@ class Dataset[T] private[sql] (val session: SparkSessio

[GitHub] [spark] zhenlineo commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
zhenlineo commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117848633 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] zhenlineo commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
zhenlineo commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117846685 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] zhenlineo commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
zhenlineo commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117846502 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] ueshin opened a new pull request, #40170: [SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names

2023-02-24 Thread via GitHub
ueshin opened a new pull request, #40170: URL: https://github.com/apache/spark/pull/40170 ### What changes were proposed in this pull request? Fixes `DataFrame.toPandas` to handle duplicated column names. ### Why are the changes needed? Currently ```py spark.sql

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117845951 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] zhenlineo commented on pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite

2023-02-24 Thread via GitHub
zhenlineo commented on PR #40169: URL: https://github.com/apache/spark/pull/40169#issuecomment-1444905157 cc @hvanhovell @amaliujia @LuciferYang I am a bit tired of losing `// scalastyle:ignore funsuite` when my imports get auto formatted. Let's move to this `ConnectFunSuite` -- This is

[GitHub] [spark] anishshri-db commented on pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold

2023-02-24 Thread via GitHub
anishshri-db commented on PR #40163: URL: https://github.com/apache/spark/pull/40163#issuecomment-1444903424 @HeartSaVioR - looks like the tests finished fine. Not sure why the Actions result is not updated here -- This is an automated message from the Apache Git Service. To respond to th

[GitHub] [spark] zhenlineo opened a new pull request, #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite

2023-02-24 Thread via GitHub
zhenlineo opened a new pull request, #40169: URL: https://github.com/apache/spark/pull/40169 ### What changes were proposed in this pull request? Make all client tests to extend from ConnectFunSuite to avoid `// scalastyle:ignore funsuite` when extending directly from `AnyFunSuite`

[GitHub] [spark] hvanhovell commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117844243 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSessio

[GitHub] [spark] zhenlineo commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
zhenlineo commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117844235 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117843680 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] amaliujia commented on a diff in pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40167: URL: https://github.com/apache/spark/pull/40167#discussion_r1117843266 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -192,6 +192,12 @@ class SparkSession( new Dataset[T](this, plan)

[GitHub] [spark] amaliujia commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117843149 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSession

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117843055 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] amaliujia commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117843058 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSession

[GitHub] [spark] HyukjinKwon closed pull request #40165: [SPARK-42568][CONNECT] Fix SparkConnectStreamHandler to handle configs properly while planning

2023-02-24 Thread via GitHub
HyukjinKwon closed pull request #40165: [SPARK-42568][CONNECT] Fix SparkConnectStreamHandler to handle configs properly while planning URL: https://github.com/apache/spark/pull/40165 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

[GitHub] [spark] HyukjinKwon commented on pull request #40165: [SPARK-42568][CONNECT] Fix SparkConnectStreamHandler to handle configs properly while planning

2023-02-24 Thread via GitHub
HyukjinKwon commented on PR #40165: URL: https://github.com/apache/spark/pull/40165#issuecomment-1444884892 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
HyukjinKwon commented on code in PR #40167: URL: https://github.com/apache/spark/pull/40167#discussion_r1117842681 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1906,6 +1906,100 @@ class Dataset[T] private[sql] (val session: SparkSess

[GitHub] [spark] hvanhovell commented on a diff in pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40168: URL: https://github.com/apache/spark/pull/40168#discussion_r1117842608 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CompatibilitySuite.scala: ## @@ -69,30 +69,131 @@ class CompatibilitySuite extends

[GitHub] [spark] HyukjinKwon closed pull request #40150: [SPARK-41834][CONNECT] Implement SparkSession.conf

2023-02-24 Thread via GitHub
HyukjinKwon closed pull request #40150: [SPARK-41834][CONNECT] Implement SparkSession.conf URL: https://github.com/apache/spark/pull/40150 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

[GitHub] [spark] HyukjinKwon commented on pull request #40150: [SPARK-41834][CONNECT] Implement SparkSession.conf

2023-02-24 Thread via GitHub
HyukjinKwon commented on PR #40150: URL: https://github.com/apache/spark/pull/40150#issuecomment-1444877430 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [spark] HyukjinKwon commented on pull request #40153: [SPARK-42547][PYTHON] Make PySpark working with Python 3.7

2023-02-24 Thread via GitHub
HyukjinKwon commented on PR #40153: URL: https://github.com/apache/spark/pull/40153#issuecomment-1444876363 I used `pip install -r dev/requirements.txt`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] dongjoon-hyun commented on pull request #40153: [SPARK-42547][PYTHON] Make PySpark working with Python 3.7

2023-02-24 Thread via GitHub
dongjoon-hyun commented on PR #40153: URL: https://github.com/apache/spark/pull/40153#issuecomment-1444872170 Thank you so much, @HyukjinKwon ! May I ask which `requirements.txt` file is used there? -- This is an automated message from the Apache Git Service. To respond to the messa

[GitHub] [spark] zhenlineo opened a new pull request, #40168: [SPARK-42573] Enable binary compatibility tests on all major client APIs

2023-02-24 Thread via GitHub
zhenlineo opened a new pull request, #40168: URL: https://github.com/apache/spark/pull/40168 ### What changes were proposed in this pull request? Make binary compatibility check for SparkSession/Dataset/Column/functions etc. ### Why are the changes needed? Help us to have a good

[GitHub] [spark] hvanhovell commented on a diff in pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40167: URL: https://github.com/apache/spark/pull/40167#discussion_r1117835151 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -192,6 +192,12 @@ class SparkSession( new Dataset[T](this, plan)

[GitHub] [spark] hvanhovell commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117834737 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSessio

[GitHub] [spark] hvanhovell commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
hvanhovell commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117834635 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSessio

[GitHub] [spark] hvanhovell closed pull request #40158: [MINOR][CONNECT] Typo fixes & update comment

2023-02-24 Thread via GitHub
hvanhovell closed pull request #40158: [MINOR][CONNECT] Typo fixes & update comment URL: https://github.com/apache/spark/pull/40158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

[GitHub] [spark] hvanhovell commented on pull request #40158: [MINOR][CONNECT] Typo fixes & update comment

2023-02-24 Thread via GitHub
hvanhovell commented on PR #40158: URL: https://github.com/apache/spark/pull/40158#issuecomment-1444830537 Merging. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] ueshin commented on pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
ueshin commented on PR #40166: URL: https://github.com/apache/spark/pull/40166#issuecomment-1444823976 > What is the default source BTW? If format is not set, the value from SQL conf 'spark.sql.sources.default' will be used. -- This is an automated message from the Apache Git Servi

[GitHub] [spark] amaliujia commented on a diff in pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40166: URL: https://github.com/apache/spark/pull/40166#discussion_r1117829037 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -122,8 +122,8 @@ message Read { } message DataSource { -// (Required) S

[GitHub] [spark] amaliujia commented on pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
amaliujia commented on PR #40166: URL: https://github.com/apache/spark/pull/40166#issuecomment-1444807986 LGTM What is the default source BTW? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] amaliujia commented on pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
amaliujia commented on PR #40167: URL: https://github.com/apache/spark/pull/40167#issuecomment-1444806239 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

[GitHub] [spark] amaliujia opened a new pull request, #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-24 Thread via GitHub
amaliujia opened a new pull request, #40167: URL: https://github.com/apache/spark/pull/40167 ### What changes were proposed in this pull request? Add temp view API to Dataset ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user

[GitHub] [spark] HyukjinKwon commented on pull request #40153: [SPARK-42547][PYTHON] Make PySpark working with Python 3.7

2023-02-24 Thread via GitHub
HyukjinKwon commented on PR #40153: URL: https://github.com/apache/spark/pull/40153#issuecomment-1444803655 Although I had to manually skip a couple of tests (due to my env issue), I tested Python 3.7, 3.8, 3.9 (by CI), 3.10 and the tests pass. -- This is an automated message from the Apa

[GitHub] [spark] github-actions[bot] commented on pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

2023-02-24 Thread via GitHub
github-actions[bot] commented on PR #38464: URL: https://github.com/apache/spark/pull/38464#issuecomment-1444788358 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] ueshin commented on a diff in pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
ueshin commented on code in PR #40166: URL: https://github.com/apache/spark/pull/40166#discussion_r1117822759 ## python/pyspark/sql/tests/test_readwriter.py: ## @@ -31,75 +31,77 @@ def test_save_and_load(self): df = self.df tmpPath = tempfile.mkdtemp()

[GitHub] [spark] ueshin commented on a diff in pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
ueshin commented on code in PR #40166: URL: https://github.com/apache/spark/pull/40166#discussion_r1117822759 ## python/pyspark/sql/tests/test_readwriter.py: ## @@ -31,75 +31,77 @@ def test_save_and_load(self): df = self.df tmpPath = tempfile.mkdtemp()

[GitHub] [spark] ueshin opened a new pull request, #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-24 Thread via GitHub
ueshin opened a new pull request, #40166: URL: https://github.com/apache/spark/pull/40166 ### What changes were proposed in this pull request? Fixes `DataFrameReader` to use the default source. ### Why are the changes needed? ```py spark.read.load(path) ``` s

[GitHub] [spark] xinrong-meng commented on pull request #40104: [SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas`

2023-02-24 Thread via GitHub
xinrong-meng commented on PR #40104: URL: https://github.com/apache/spark/pull/40104#issuecomment-1444765441 Merged to master and branch-3.4, thanks all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] xinrong-meng closed pull request #40104: [SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas`

2023-02-24 Thread via GitHub
xinrong-meng closed pull request #40104: [SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` URL: https://github.com/apache/spark/pull/40104 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] anishshri-db commented on a diff in pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold

2023-02-24 Thread via GitHub
anishshri-db commented on code in PR #40163: URL: https://github.com/apache/spark/pull/40163#discussion_r1117787543 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -533,11 +539,22 @@ object StateStore extends Logging { }

[GitHub] [spark] HeartSaVioR commented on pull request #40161: [SPARK-42565][SS] Error log improvement for the lock acquisition of RocksDB state store instance

2023-02-24 Thread via GitHub
HeartSaVioR commented on PR #40161: URL: https://github.com/apache/spark/pull/40161#issuecomment-1444617521 Thanks! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HeartSaVioR closed pull request #40161: [SPARK-42565][SS] Error log improvement for the lock acquisition of RocksDB state store instance

2023-02-24 Thread via GitHub
HeartSaVioR closed pull request #40161: [SPARK-42565][SS] Error log improvement for the lock acquisition of RocksDB state store instance URL: https://github.com/apache/spark/pull/40161 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold

2023-02-24 Thread via GitHub
HeartSaVioR commented on code in PR #40163: URL: https://github.com/apache/spark/pull/40163#discussion_r1117774538 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -533,11 +539,22 @@ object StateStore extends Logging { }

[GitHub] [spark] srowen commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-24 Thread via GitHub
srowen commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1444598032 Looks like 1.22 removed it actually. That's still not recent. Yeah I think this is worth back porting -- This is an automated message from the Apache Git Service. To respond to the messa

[GitHub] [spark] amaliujia commented on a diff in pull request #40013: [SPARK-42367][CONNECT][PYTHON] `DataFrame.drop` should handle duplicated columns properly

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40013: URL: https://github.com/apache/spark/pull/40013#discussion_r1117739382 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1346,16 +1346,16 @@ class SparkConnectPlanner(val se

[GitHub] [spark] amaliujia commented on pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
amaliujia commented on PR #40164: URL: https://github.com/apache/spark/pull/40164#issuecomment-1444587648 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

[GitHub] [spark] ueshin opened a new pull request, #40165: [SPARK-42568][CONNECT] Fix SparkConnectStreamHandler to handle configs properly while planning

2023-02-24 Thread via GitHub
ueshin opened a new pull request, #40165: URL: https://github.com/apache/spark/pull/40165 ### What changes were proposed in this pull request? Fixes `SparkConnectStreamHandler` to handle configs properly while planning. The whole process should be done in `session.withActive` to

[GitHub] [spark] amaliujia opened a new pull request, #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-24 Thread via GitHub
amaliujia opened a new pull request, #40164: URL: https://github.com/apache/spark/pull/40164 ### What changes were proposed in this pull request? Match https://github.com/apache/spark/blob/6a2433070e60ad02c69ae45706a49cdd0b88a082/python/pyspark/sql/connect/dataframe.py#L1500

[GitHub] [spark] aimtsou commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-24 Thread via GitHub
aimtsou commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1444586606 Yes we agree that users can limit their numpy system installation to < 1.20.0, if they use Spark 3.3 I will have to check and test the different versions but I believe according to

[GitHub] [spark] srowen commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-24 Thread via GitHub
srowen commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1444561437 Well, I think we're talking about numpy 1.20 here, not >1.20. You're correct that you therefore would not use the latest versions of numpy with Spark 3.3, but would work with 3.4. If that

[GitHub] [spark] aimtsou commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-24 Thread via GitHub
aimtsou commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1444549904 Hi @srowen, Thank you for your very prompt reply. You are not correct about the error, after 1.20.0 it creates an attribute error ``` if attr in __form

[GitHub] [spark] dtenedor commented on pull request #39449: [SPARK-40688][SQL] Support data masking built-in function 'mask_first_n'

2023-02-24 Thread via GitHub
dtenedor commented on PR #39449: URL: https://github.com/apache/spark/pull/39449#issuecomment-1444538939 @cloud-fan would you mind helping review this as well? It LGTM as of now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] srowen commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-24 Thread via GitHub
srowen commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1444533942 This is just a deprecation warning, not an error, right? I don't see a particular urgency here. I don't think this is related to Databricks, particularly, either - Databricks can do wha

[GitHub] [spark] anishshri-db commented on pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold

2023-02-24 Thread via GitHub
anishshri-db commented on PR #40163: URL: https://github.com/apache/spark/pull/40163#issuecomment-1444530972 @HeartSaVioR - please take a look. Thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] anishshri-db opened a new pull request, #40163: [SPARK-42567] Track load time for state store provider and log warning if it exceeds threshold

2023-02-24 Thread via GitHub
anishshri-db opened a new pull request, #40163: URL: https://github.com/apache/spark/pull/40163 ### What changes were proposed in this pull request? Track load time for state store provider and log warning if it exceeds threshold ### Why are the changes needed? We ha

[GitHub] [spark] aimtsou commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-24 Thread via GitHub
aimtsou commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1444515612 @srowen: Although this is causing an issue: If you try to build your own docker image of Spark including pyspark while trying to be compliant with Databricks you will observe that D

[GitHub] [spark] allisonport-db commented on a diff in pull request #38823: [SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements

2023-02-24 Thread via GitHub
allisonport-db commented on code in PR #38823: URL: https://github.com/apache/spark/pull/38823#discussion_r1117631114 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] allisonport-db commented on a diff in pull request #38823: [SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements

2023-02-24 Thread via GitHub
allisonport-db commented on code in PR #38823: URL: https://github.com/apache/spark/pull/38823#discussion_r1117619914 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] allisonport-db commented on a diff in pull request #38823: [SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements

2023-02-24 Thread via GitHub
allisonport-db commented on code in PR #38823: URL: https://github.com/apache/spark/pull/38823#discussion_r1117614206 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala: ## @@ -136,6 +136,13 @@ case class DataSourceAnalysis(analyzer: A

[GitHub] [spark] dongjoon-hyun closed pull request #40140: [3.3][SPARK-42286][SPARK-41991][SPARK-42473][SQL] Fallback to previous codegen code path for complex expr with CAST

2023-02-24 Thread via GitHub
dongjoon-hyun closed pull request #40140: [3.3][SPARK-42286][SPARK-41991][SPARK-42473][SQL] Fallback to previous codegen code path for complex expr with CAST URL: https://github.com/apache/spark/pull/40140 -- This is an automated message from the Apache Git Service. To respond to the message

[GitHub] [spark] dongjoon-hyun commented on pull request #40140: [3.3][SPARK-42286][SPARK-41991][SPARK-42473][SQL] Fallback to previous codegen code path for complex expr with CAST

2023-02-24 Thread via GitHub
dongjoon-hyun commented on PR #40140: URL: https://github.com/apache/spark/pull/40140#issuecomment-1444383771 cc @huaxingao too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

[GitHub] [spark] huanliwang-db opened a new pull request, #40162: [SPARK-42566][SS] RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD

2023-02-24 Thread via GitHub
huanliwang-db opened a new pull request, #40162: URL: https://github.com/apache/spark/pull/40162 The current behavior of the `compute` method in both `StateStoreRDD` and `ReadStateStoreRDD` is: we first get the state store instance and then get the input iterator for the inputRDD. Fo

[GitHub] [spark] amaliujia commented on a diff in pull request #40150: [SPARK-41834][CONNECT] Implement SparkSession.conf

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40150: URL: https://github.com/apache/spark/pull/40150#discussion_r1117500909 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,141 @@ message ExecutePlanResponse { } } +// The placeholder for the c

[GitHub] [spark] amaliujia commented on pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types

2023-02-24 Thread via GitHub
amaliujia commented on PR #40143: URL: https://github.com/apache/spark/pull/40143#issuecomment-1444271928 LGTM but please rebase this PR to solve conflict. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] huanliwang-db opened a new pull request, #40161: [SPARK-42565][SS] Error log improve ment for the lock acquisition of RocksDB state store instance

2023-02-24 Thread via GitHub
huanliwang-db opened a new pull request, #40161: URL: https://github.com/apache/spark/pull/40161 ``` "23/02/23 23:57:44 INFO Executor: Running task 2.0 in stage 57.1 (TID 363) "23/02/23 23:58:44 ERROR RocksDB StateStoreId(opId=0,partId=3,name=default): RocksDB instance could n

[GitHub] [spark] gengliangwang commented on pull request #40140: [3.3][SPARK-42286][SPARK-41991][SPARK-42473][SQL] Fallback to previous codegen code path for complex expr with CAST

2023-02-24 Thread via GitHub
gengliangwang commented on PR #40140: URL: https://github.com/apache/spark/pull/40140#issuecomment-1444261708 Thanks, merging to 3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] hvanhovell commented on pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types

2023-02-24 Thread via GitHub
hvanhovell commented on PR #40143: URL: https://github.com/apache/spark/pull/40143#issuecomment-1444249755 @LuciferYang @panbingkun I created an epic with a bunch of things you can pick up: https://issues.apache.org/jira/browse/SPARK-42554 -- This is an automated message from the Apache G

[GitHub] [spark] hvanhovell commented on pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types

2023-02-24 Thread via GitHub
hvanhovell commented on PR #40143: URL: https://github.com/apache/spark/pull/40143#issuecomment-1444170840 @LuciferYang can you update your PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

2023-02-24 Thread via GitHub
khalidmammadov commented on PR #40015: URL: https://github.com/apache/spark/pull/40015#issuecomment-1444167507 Thanks @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

[GitHub] [spark] ueshin commented on a diff in pull request #40150: [SPARK-41834][CONNECT] Implement SparkSession.conf

2023-02-24 Thread via GitHub
ueshin commented on code in PR #40150: URL: https://github.com/apache/spark/pull/40150#discussion_r1117409532 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,141 @@ message ExecutePlanResponse { } } +// The placeholder for the conf

[GitHub] [spark] dtenedor commented on a diff in pull request #38823: [SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements

2023-02-24 Thread via GitHub
dtenedor commented on code in PR #38823: URL: https://github.com/apache/spark/pull/38823#discussion_r1117384372 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GeneratedColumn.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [spark] xkrogen commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
xkrogen commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117343251 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -193,5 +247,9 @@ service SparkConnectService { // Analyzes a query and returns a [[An

[GitHub] [spark] xkrogen commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
xkrogen commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117339081 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,87 @@ message ExecutePlanResponse { } } +// Request to transfer client-l

[GitHub] [spark] xkrogen commented on pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
xkrogen commented on PR #40147: URL: https://github.com/apache/spark/pull/40147#issuecomment-1444053186 > This code could be executed within the same engine, however we can also use separate processes or VMs to execute this code. Good point; nothing about the protocol itself precludes

[GitHub] [spark] amaliujia commented on a diff in pull request #40145: [SPARK-42541][CONNECT] Support Pivot with provided pivot column values

2023-02-24 Thread via GitHub
amaliujia commented on code in PR #40145: URL: https://github.com/apache/spark/pull/40145#discussion_r1117335432 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -47,14 +48,18 @@ class RelationalGroupedDataset protected[sq

[GitHub] [spark] xkrogen commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
xkrogen commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117330163 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,60 @@ message ExecutePlanResponse { } } +// Request to transfer client-l

[GitHub] [spark] hvanhovell closed pull request #40156: [SPARK-41823][CONNECT] Scala Client resolve ambiguous columns in Join

2023-02-24 Thread via GitHub
hvanhovell closed pull request #40156: [SPARK-41823][CONNECT] Scala Client resolve ambiguous columns in Join URL: https://github.com/apache/spark/pull/40156 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] xkrogen commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
xkrogen commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117328132 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,60 @@ message ExecutePlanResponse { } } +// Request to transfer client-l

[GitHub] [spark] hvanhovell commented on pull request #40156: [SPARK-41823][CONNECT] Scala Client resolve ambiguous columns in Join

2023-02-24 Thread via GitHub
hvanhovell commented on PR #40156: URL: https://github.com/apache/spark/pull/40156#issuecomment-1444039024 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

[GitHub] [spark] alkis commented on a diff in pull request #40121: [SPARK-42528][CORE] Optimize PercentileHeap

2023-02-24 Thread via GitHub
alkis commented on code in PR #40121: URL: https://github.com/apache/spark/pull/40121#discussion_r1117321434 ## core/src/test/scala/org/apache/spark/util/collection/PercentileHeapSuite.scala: ## @@ -17,71 +17,73 @@ package org.apache.spark.util.collection -import java.util.

[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-02-24 Thread via GitHub
shrprasa commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1444006843 @holdenk @gaborgsomogyi @HyukjinKwon @Ngone51 will really appreciate if any one of you can review this PR. I am not sure why no one is responding even after several pings over the last 6

[GitHub] [spark] xkrogen commented on pull request #40144: [SPARK-42539][SQL][HIVE] Elminiate separate classloader when using 'builtin' Hive version for metadata client

2023-02-24 Thread via GitHub
xkrogen commented on PR #40144: URL: https://github.com/apache/spark/pull/40144#issuecomment-1444004552 Great question @cloud-fan , and actually no, we don't. For all of the other values of `spark.sql.hive.metastore.jars` besides 'builtin', the user JARs are not included at all ([refer to t

[GitHub] [spark] ueshin commented on a diff in pull request #40150: [SPARK-41834][CONNECT] Implement SparkSession.conf

2023-02-24 Thread via GitHub
ueshin commented on code in PR #40150: URL: https://github.com/apache/spark/pull/40150#discussion_r1117302257 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,141 @@ message ExecutePlanResponse { } } +// The placeholder for the conf

[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
vicennial commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117229154 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,60 @@ message ExecutePlanResponse { } } +// Request to transfer client

[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-24 Thread via GitHub
vicennial commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117224826 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,60 @@ message ExecutePlanResponse { } } +// Request to transfer client

  1   2   >