Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-09-01 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2323232116 @HyukjinKwon @zhengruifeng @cloud-fan Sorry for tagging but maybe you can take a look again? I fixed everything from the last review round... Tnx in advance! -- This is an auto

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-08-21 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2301302305 @HyukjinKwon @zhengruifeng @cloud-fan Sorry for tagging but maybe you can take a look again? I fixed everything from the last review round... Tnx in advance! -- This is an auto

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-07-30 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2257672072 > @SemyonSinchenko thanks for the contribution! I have two high-level questions: > > * How do you use this sizeInBytes? Is it accurate in your workload? > * Shall we expo

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-07-29 Thread via GitHub
cloud-fan commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2257228155 @SemyonSinchenko thanks for the contribution! I have two high-level questions: - How do you use this sizeInBytes? Is it accurate in your workload? - Shall we expose more stats

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-07-27 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2254250916 @HyukjinKwon @zhengruifeng Sorry for tagging but maybe you can take a look again? I fixed everything from the last review round... Tnx in advance! -- This is an automated messag

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-06-05 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2150609665 @HyukjinKwon I'm sorry for tagging you again, but maybe you can make a look? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-28 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2135617854 Changes from the last two commits (actual changes marked by bold): - resolve merge conflicts - re-generate proto files for PySpark - **update docstring in** `dataframe.py`*

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-20 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2120307028 @HyukjinKwon sorry for tagging, but may you please make a look again? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-07 Thread via GitHub
SemyonSinchenko commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1592933898 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -283,6 +283,16 @@ class Dataset[T] private[sql] ( def printSchema(le

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-07 Thread via GitHub
SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2099107642 New changes: - fixes from comments - **changing the type from Long to BigInteger** (`bytes` in proto) -- This is an automated message from the Apache Git Service. To respon

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-07 Thread via GitHub
SemyonSinchenko commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1592933193 ## .gitignore: ## @@ -26,6 +26,7 @@ .scala_dependencies .settings .vscode +.dir-locals.el Review Comment: Done -- This is an automated message from t

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-07 Thread via GitHub
SemyonSinchenko commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1592932992 ## python/pyspark/sql/dataframe.py: ## @@ -657,6 +657,19 @@ def printSchema(self, level: Optional[int] = None) -> None: """ ... +@dispat

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub
zhengruifeng commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591773238 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -283,6 +283,16 @@ class Dataset[T] private[sql] ( def printSchema(level

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub
HyukjinKwon commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591746209 ## python/pyspark/sql/connect/client/core.py: ## @@ -1157,6 +1163,20 @@ def _analyze_plan_request_with_metadata(self) -> pb2.AnalyzePlanRequest: req.u

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub
HyukjinKwon commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591746323 ## python/pyspark/sql/dataframe.py: ## @@ -657,6 +657,19 @@ def printSchema(self, level: Optional[int] = None) -> None: """ ... +@dispatch_d

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-06 Thread via GitHub
HyukjinKwon commented on code in PR #46368: URL: https://github.com/apache/spark/pull/46368#discussion_r1591745159 ## .gitignore: ## @@ -26,6 +26,7 @@ .scala_dependencies .settings .vscode +.dir-locals.el Review Comment: let's remove this -- This is an automated messa

[PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

2024-05-03 Thread via GitHub
SemyonSinchenko opened a new pull request, #46368: URL: https://github.com/apache/spark/pull/46368 ### What changes were proposed in this pull request? In PySpark connect there is no access to JVM to call `queryExecution().optimizedPlan.stats`. So, there is no way to get information