[GitHub] [spark] liuzqt commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

2022-11-10 Thread GitBox
liuzqt commented on PR #38064: URL: https://github.com/apache/spark/pull/38064#issuecomment-1311348015 @mridulm I've tried `local-cluster[1,1,3072]` but doesn't help, I guess. Is there any way to turn up the JVM mem in github action job? -- This is an automated message from the Apache Gi

[GitHub] [spark] cloud-fan closed pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
cloud-fan closed pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner URL: https://github.com/apache/spark/pull/38604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] cloud-fan commented on pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
cloud-fan commented on PR #38604: URL: https://github.com/apache/spark/pull/38604#issuecomment-1311347785 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] cloud-fan commented on a diff in pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019959789 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -50,9 +49,9 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] LuciferYang commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
LuciferYang commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-1311346776 @mridulm or call `TestUtils.configTestLog4j2("DEBUG")` before this test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

[GitHub] [spark] LuciferYang commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
LuciferYang commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-1311343738 Maybe we can modify `src/test/resources/log4j2.properties` print all logs to stdout? -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] LuciferYang commented on pull request #38620: [SPARK-41113][BUILD] Upgrade sbt to 1.8.0

2022-11-10 Thread GitBox
LuciferYang commented on PR #38620: URL: https://github.com/apache/spark/pull/38620#issuecomment-1311333706 test first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

[GitHub] [spark] LuciferYang opened a new pull request, #38620: [SPARK-41113][BUILD] Upgrade sbt to 1.8.0

2022-11-10 Thread GitBox
LuciferYang opened a new pull request, #38620: URL: https://github.com/apache/spark/pull/38620 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was thi

[GitHub] [spark] ulysses-you commented on pull request #38619: [SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter

2022-11-10 Thread GitBox
ulysses-you commented on PR #38619: URL: https://github.com/apache/spark/pull/38619#issuecomment-1311331866 cc @wangyum @cloud-fan @sigmod thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [spark] ulysses-you commented on a diff in pull request #38619: [SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter

2022-11-10 Thread GitBox
ulysses-you commented on code in PR #38619: URL: https://github.com/apache/spark/pull/38619#discussion_r1019946997 ## sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala: ## @@ -257,6 +257,11 @@ class InjectRuntimeFilterSuite extends QueryTest with SQLTe

[GitHub] [spark] ulysses-you opened a new pull request, #38619: [SPARK-41112][SQL] RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter

2022-11-10 Thread GitBox
ulysses-you opened a new pull request, #38619: URL: https://github.com/apache/spark/pull/38619 ### What changes were proposed in this pull request? Apply ColumnPruning for in subquery filter. Note that, the bloom filter side has already fixed by https://github.com/apach

[GitHub] [spark] Ngone51 commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

2022-11-10 Thread GitBox
Ngone51 commented on PR #38064: URL: https://github.com/apache/spark/pull/38064#issuecomment-1311315439 Should the PR title be changed to something like "Remove the limitation of a single task result must fit in 2GB"? -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] mridulm commented on pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on PR #38617: URL: https://github.com/apache/spark/pull/38617#issuecomment-1311311848 Can you merge this if the tests pass @HyukjinKwon ? I might not be online tomorrow and it is getting late tonight for me :-) -- This is an automated message from the Apache Git Service.

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38616: [SPARK-41110][CONNECT][PYTHON] Implement `DataFrame.sparkSession` in Python client

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38616: URL: https://github.com/apache/spark/pull/38616#discussion_r1019909673 ## python/pyspark/sql/connect/dataframe.py: ## @@ -143,6 +143,17 @@ def columns(self) -> List[str]: return self.schema().names +def sparkSession(se

[GitHub] [spark] HyukjinKwon commented on pull request #38468: [SPARK-41005][CONNECT][PYTHON] Arrow-based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38468: URL: https://github.com/apache/spark/pull/38468#issuecomment-1311310134 Made another PR to refactor and deduplicate the Arrow codes PTAL: https://github.com/apache/spark/pull/38618 -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] HyukjinKwon opened a new pull request, #38618: [SPARK-41108][SPARK-41005][CONNECT][FOLLOW-UP] Deduplicate ArrowConverters codes

2022-11-10 Thread GitBox
HyukjinKwon opened a new pull request, #38618: URL: https://github.com/apache/spark/pull/38618 ### What changes were proposed in this pull request? This PR is a followup of both https://github.com/apache/spark/pull/38468 and https://github.com/apache/spark/pull/38612 that proposes to

[GitHub] [spark] beatbull commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

2022-11-10 Thread GitBox
beatbull commented on PR #33828: URL: https://github.com/apache/spark/pull/33828#issuecomment-1311299286 Hi, sadly this PR got closed (automatically due to inactivity). We'd be interested in this feature & config option since the ".spark-staging-*" folders are causing trouble e.g. when usin

[GitHub] [spark] panbingkun commented on a diff in pull request #38555: [SPARK-41044][SQL] Convert DATATYPE_MISMATCH.UNSPECIFIED_FRAME to INTERNAL_ERROR

2022-11-10 Thread GitBox
panbingkun commented on code in PR #38555: URL: https://github.com/apache/spark/pull/38555#discussion_r1019880123 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala: ## @@ -66,7 +66,13 @@ case class WindowSpecDefinition( override

[GitHub] [spark] MaxGekk closed pull request #38582: [SPARK-41095][SQL] Convert unresolved operators to internal errors

2022-11-10 Thread GitBox
MaxGekk closed pull request #38582: [SPARK-41095][SQL] Convert unresolved operators to internal errors URL: https://github.com/apache/spark/pull/38582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] MaxGekk commented on pull request #38582: [SPARK-41095][SQL] Convert unresolved operators to internal errors

2022-11-10 Thread GitBox
MaxGekk commented on PR #38582: URL: https://github.com/apache/spark/pull/38582#issuecomment-1311288176 Merging to master. Thank you, @cloud-fan and @LuciferYang for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] amaliujia commented on a diff in pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019878719 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -50,9 +49,9 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] MaxGekk closed pull request #38572: [SPARK-41059][SQL] Rename `_LEGACY_ERROR_TEMP_2420` to `NESTED_AGGREGATE_FUNCTION`

2022-11-10 Thread GitBox
MaxGekk closed pull request #38572: [SPARK-41059][SQL] Rename `_LEGACY_ERROR_TEMP_2420` to `NESTED_AGGREGATE_FUNCTION` URL: https://github.com/apache/spark/pull/38572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

[GitHub] [spark] panbingkun commented on a diff in pull request #38555: [SPARK-41044][SQL] Convert DATATYPE_MISMATCH.UNSPECIFIED_FRAME to INTERNAL_ERROR

2022-11-10 Thread GitBox
panbingkun commented on code in PR #38555: URL: https://github.com/apache/spark/pull/38555#discussion_r1019878522 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala: ## @@ -57,16 +58,17 @@ case class WindowSpecDefinition( fram

[GitHub] [spark] panbingkun commented on a diff in pull request #38555: [SPARK-41044][SQL] Convert DATATYPE_MISMATCH.UNSPECIFIED_FRAME to INTERNAL_ERROR

2022-11-10 Thread GitBox
panbingkun commented on code in PR #38555: URL: https://github.com/apache/spark/pull/38555#discussion_r1019878119 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala: ## @@ -66,7 +66,13 @@ case class WindowSpecDefinition( override

[GitHub] [spark] MaxGekk commented on pull request #38572: [SPARK-41059][SQL] Rename `_LEGACY_ERROR_TEMP_2420` to `NESTED_AGGREGATE_FUNCTION`

2022-11-10 Thread GitBox
MaxGekk commented on PR #38572: URL: https://github.com/apache/spark/pull/38572#issuecomment-1311286385 +1, LGTM. Merging to master. Thank you, @itholic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] panbingkun commented on a diff in pull request #38555: [SPARK-41044][SQL] Convert DATATYPE_MISMATCH.UNSPECIFIED_FRAME to INTERNAL_ERROR

2022-11-10 Thread GitBox
panbingkun commented on code in PR #38555: URL: https://github.com/apache/spark/pull/38555#discussion_r1019877660 ## core/src/main/resources/error/error-classes.json: ## @@ -219,6 +219,11 @@ "Input to the function cannot contain elements of the \"MAP\" type. In Spar

[GitHub] [spark] LuciferYang commented on a diff in pull request #38609: [WIP][SPARK-40593][CONNECT] Add profile to make user can specify custom `protocExecutable` and `pluginExecutable` when building

2022-11-10 Thread GitBox
LuciferYang commented on code in PR #38609: URL: https://github.com/apache/spark/pull/38609#discussion_r1019875427 ## project/SparkBuild.scala: ## @@ -109,6 +109,16 @@ object SparkBuild extends PomBuild { if (profiles.contains("jdwp-test-debug")) { sys.props.put("tes

[GitHub] [spark] LuciferYang commented on a diff in pull request #38609: [WIP][SPARK-40593][CONNECT] Add profile to make user can specify custom `protocExecutable` and `pluginExecutable` when building

2022-11-10 Thread GitBox
LuciferYang commented on code in PR #38609: URL: https://github.com/apache/spark/pull/38609#discussion_r1019875427 ## project/SparkBuild.scala: ## @@ -109,6 +109,16 @@ object SparkBuild extends PomBuild { if (profiles.contains("jdwp-test-debug")) { sys.props.put("tes

[GitHub] [spark] mridulm commented on a diff in pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on code in PR #38617: URL: https://github.com/apache/spark/pull/38617#discussion_r1019866602 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4559,8 +4564,8 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkCo

[GitHub] [spark] mridulm commented on pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on PR #38617: URL: https://github.com/apache/spark/pull/38617#issuecomment-1311274093 I am still not able to reproduce this locally - but logically, this looks like the right fix. -- This is an automated message from the Apache Git Service. To respond to the message, ple

[GitHub] [spark] mridulm commented on a diff in pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on code in PR #38617: URL: https://github.com/apache/spark/pull/38617#discussion_r1019866602 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4559,8 +4564,8 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkCo

[GitHub] [spark] mridulm commented on pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on PR #38617: URL: https://github.com/apache/spark/pull/38617#issuecomment-1311267937 +CC @HyukjinKwon, @LuciferYang, @wankunde -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] mridulm commented on a diff in pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on code in PR #38617: URL: https://github.com/apache/spark/pull/38617#discussion_r1019860399 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4559,8 +4563,8 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkCo

[GitHub] [spark] mridulm commented on a diff in pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on code in PR #38617: URL: https://github.com/apache/spark/pull/38617#discussion_r1019860144 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4533,16 +4533,20 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSpark

[GitHub] [spark] mridulm commented on a diff in pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on code in PR #38617: URL: https://github.com/apache/spark/pull/38617#discussion_r1019859895 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4533,16 +4533,20 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSpark

[GitHub] [spark] mridulm opened a new pull request, #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm opened a new pull request, #38617: URL: https://github.com/apache/spark/pull/38617 ### What changes were proposed in this pull request? Fix flakey test failure ### Why are the changes needed? MT-safety issue in test ### Does this PR introduce _any_ user-facing chan

[GitHub] [spark] rangadi commented on a diff in pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-10 Thread GitBox
rangadi commented on code in PR #38603: URL: https://github.com/apache/spark/pull/38603#discussion_r1019851951 ## python/pyspark/sql/protobuf/functions.py: ## @@ -32,7 +32,7 @@ def from_protobuf( data: "ColumnOrName", messageName: str, -descFilePath: str, +des

[GitHub] [spark] rangadi commented on a diff in pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-10 Thread GitBox
rangadi commented on code in PR #38603: URL: https://github.com/apache/spark/pull/38603#discussion_r1019850935 ## python/pyspark/sql/protobuf/functions.py: ## @@ -48,8 +48,11 @@ def from_protobuf( -- data : :class:`~pyspark.sql.Column` or str the binar

[GitHub] [spark] rangadi commented on a diff in pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-10 Thread GitBox
rangadi commented on code in PR #38603: URL: https://github.com/apache/spark/pull/38603#discussion_r1019850935 ## python/pyspark/sql/protobuf/functions.py: ## @@ -48,8 +48,11 @@ def from_protobuf( -- data : :class:`~pyspark.sql.Column` or str the binar

[GitHub] [spark] amaliujia opened a new pull request, #38616: [SPARK-41110][CONNECT][PYTHON] Implement `DataFrame.sparkSession` in Python client

2022-11-10 Thread GitBox
amaliujia opened a new pull request, #38616: URL: https://github.com/apache/spark/pull/38616 ### What changes were proposed in this pull request? This PR implements `DataFrame.sparkSession` in Python client. The only difference between this API and the one in PySpark is that t

[GitHub] [spark] zhengchenyu commented on pull request #37949: [SPARK-40504][YARN] Make yarn appmaster load config from client

2022-11-10 Thread GitBox
zhengchenyu commented on PR #37949: URL: https://github.com/apache/spark/pull/37949#issuecomment-1311254076 @xkrogen Thanks for your review. In our cluster, YARN_CONF_DIR is same with HADOOP_CONF_DIR. SparkHadoopUtil.newConfiguration is different from SparkHadoopUtil.get.newConfig

[GitHub] [spark] mridulm commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-1311254131 Unfortunately, I did not find the unit test log files in this - based on local build, it is at `core/target/unit-tests.log` Is there a way to get to this @HyukjinKwon ? Thanks ! -- T

[GitHub] [spark] HyukjinKwon commented on pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38612: URL: https://github.com/apache/spark/pull/38612#issuecomment-1311251418 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] HyukjinKwon closed pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
HyukjinKwon closed pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch URL: https://github.com/apache/spark/pull/38612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38612: URL: https://github.com/apache/spark/pull/38612#issuecomment-1311251340 Let me actually merge and refactor this out. I am working on it actually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

[GitHub] [spark] panbingkun opened a new pull request, #38615: [SPARK-41109][SQL] Rename the error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN

2022-11-10 Thread GitBox
panbingkun opened a new pull request, #38615: URL: https://github.com/apache/spark/pull/38615 ### What changes were proposed in this pull request? In the PR, I propose to rename the legacy error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN. ### Why are the changes needed?

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
zhengruifeng commented on code in PR #38612: URL: https://github.com/apache/spark/pull/38612#discussion_r1019843494 ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala: ## @@ -161,17 +166,23 @@ private[sql] object ArrowConverters extends Logging

[GitHub] [spark] amaliujia commented on a diff in pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019842183 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -50,9 +49,9 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] HyukjinKwon commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-1311241568 https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/194716/signedlogcontent/21?urlExpires=2022-11-11T05%3A16%3A59.8

[GitHub] [spark] cloud-fan commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1019839623 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3804,6 +3804,13 @@ class Dataset[T] private[sql]( } catch { case _: ParseException =

[GitHub] [spark] HyukjinKwon commented on pull request #38609: [WIP][SPARK-40593][CONNECT] Add profile to make user can specify custom `protocExecutable` and `pluginExecutable` when building connect m

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1311239503 cc @grundprinzip @amaliujia FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[GitHub] [spark] zhengruifeng commented on pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-10 Thread GitBox
zhengruifeng commented on PR #38614: URL: https://github.com/apache/spark/pull/38614#issuecomment-1311239224 close this PR in favor of https://github.com/apache/spark/pull/38613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] zhengruifeng closed pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-10 Thread GitBox
zhengruifeng closed pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob` URL: https://github.com/apache/spark/pull/38614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] mridulm commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-10 Thread GitBox
mridulm commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-1311239134 Same here @LuciferYang, I am not able to reproduce it locally. @HyukjinKwon, is there a way to get to the surefire-reports log files from CI ? -- This is an automated message from th

[GitHub] [spark] HyukjinKwon commented on pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38614: URL: https://github.com/apache/spark/pull/38614#issuecomment-1311238885 https://github.com/apache/spark/pull/38613 will handle this actually. Let's leave this closed. -- This is an automated message from the Apache Git Service. To respond to the message

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38612: URL: https://github.com/apache/spark/pull/38612#discussion_r1019838326 ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala: ## @@ -161,17 +166,23 @@ private[sql] object ArrowConverters extends Logging

[GitHub] [spark] cloud-fan commented on a diff in pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r101983 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -50,9 +49,9 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] amaliujia commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1019836218 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3804,6 +3804,13 @@ class Dataset[T] private[sql]( } catch { case _: ParseException =

[GitHub] [spark] cloud-fan commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1019835252 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3804,6 +3804,13 @@ class Dataset[T] private[sql]( } catch { case _: ParseException =

[GitHub] [spark] cloud-fan commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1019835019 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3804,6 +3804,13 @@ class Dataset[T] private[sql]( } catch { case _: ParseException =

[GitHub] [spark] HyukjinKwon commented on pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38613: URL: https://github.com/apache/spark/pull/38613#issuecomment-1311232973 It collects all results first because of synced `runJob` that waits all results to arrive. -- This is an automated message from the Apache Git Service. To respond to the message, pl

[GitHub] [spark] amaliujia commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1019834165 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1135,21 +1135,27 @@ class DatasetSuite extends QueryTest } test("createTempView") {

[GitHub] [spark] cloud-fan commented on pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
cloud-fan commented on PR #38613: URL: https://github.com/apache/spark/pull/38613#issuecomment-1311232475 > Previously, it actually waits until all results are stored all first Really? I think the best case is also sending partitions one by one. Anyway, this PR looks good as it

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
zhengruifeng commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019833916 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class SparkConnectStreamHandler(r

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019833127 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class SparkConnectStreamHandler(re

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019831985 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class SparkConnectStreamHandler(re

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019831985 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class SparkConnectStreamHandler(re

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
zhengruifeng commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019830407 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class SparkConnectStreamHandler(r

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
zhengruifeng commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019829685 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class SparkConnectStreamHandler(r

[GitHub] [spark] amaliujia commented on a diff in pull request #38607: [SPARK-40938][CONNECT][PYTHON][FOLLOW-UP] Fix SubqueryAlias without the child plan when constructing Connect proto in the Python

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38607: URL: https://github.com/apache/spark/pull/38607#discussion_r1019829371 ## python/pyspark/sql/connect/plan.py: ## @@ -712,6 +712,8 @@ def __init__(self, child: Optional["LogicalPlan"], alias: str) -> None: def plan(self, session:

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019826331 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -56,7 +56,7 @@ class SparkConnectStreamHandler(respo

[GitHub] [spark] zhengruifeng commented on pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-10 Thread GitBox
zhengruifeng commented on PR #38614: URL: https://github.com/apache/spark/pull/38614#issuecomment-1311222793 thanks @HyukjinKwon for pointing it out. also cc @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] zhengruifeng opened a new pull request, #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-10 Thread GitBox
zhengruifeng opened a new pull request, #38614: URL: https://github.com/apache/spark/pull/38614 ### What changes were proposed in this pull request? use `submitJob` instead of `runJob` ### Why are the changes needed? `spark.sparkContext.runJob` is blocked until finishes all p

[GitHub] [spark] cloud-fan commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1019825048 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -56,7 +56,7 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] amaliujia commented on a diff in pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019820897 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -39,14 +46,17 @@ final case class InvalidPlanInput( pri

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38468: [SPARK-41005][CONNECT][PYTHON] Arrow-based collect

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38468: URL: https://github.com/apache/spark/pull/38468#discussion_r1019820867 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -114,10 +120,93 @@ class SparkConnectStreamHandler(r

[GitHub] [spark] amaliujia commented on a diff in pull request #38604: [SPARK-41102][CONNECT] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019820702 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -50,9 +49,9 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] HyukjinKwon opened a new pull request, #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-10 Thread GitBox
HyukjinKwon opened a new pull request, #38613: URL: https://github.com/apache/spark/pull/38613 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/38468 that proposes to remove notify-wait approach, and introduce a new way

[GitHub] [spark] amaliujia commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-10 Thread GitBox
amaliujia commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1019816022 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1135,21 +1135,27 @@ class DatasetSuite extends QueryTest } test("createTempView") {

[GitHub] [spark] amaliujia commented on pull request #38606: [SPARK-41105][CONNECT] Adopt `optional` keyword from proto3 which offers `hasXXX` to differentiate if a field is set or unset

2022-11-10 Thread GitBox
amaliujia commented on PR #38606: URL: https://github.com/apache/spark/pull/38606#issuecomment-1311202518 @cloud-fan We need to a bit more discussion on when to use `optional`. Right now the most obvious usage is to replace those `message` wrap. One example is, if a field is r

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
zhengruifeng commented on code in PR #38612: URL: https://github.com/apache/spark/pull/38612#discussion_r1019806473 ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala: ## @@ -161,17 +166,23 @@ private[sql] object ArrowConverters extends Logging

[GitHub] [spark] LuciferYang commented on pull request #38609: [WIP][SPARK-40593][CONNECT] Add profile to make user can specify custom `protocExecutable` and `pluginExecutable` when building connect m

2022-11-10 Thread GitBox
LuciferYang commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1311196281 Let me finish the sbt part first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [spark] LuciferYang commented on pull request #38609: [WIP][SPARK-40593][CONNECT] Add profile to make user can specify custom `protocExecutable` and `pluginExecutable` when building connect m

2022-11-10 Thread GitBox
LuciferYang commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1311194812 User need to manually compile `protoc-xxx-linux-x86_64.exe` and `protoc-gen-grpc-java-1.47.0-linux-x86_64.exe` can executable on `CentOs6&CentOs7`. Or pre install the library t

[GitHub] [spark] zhengruifeng opened a new pull request, #38612: [SPARK-41108][CONNECT] Control the max size of arrow batch

2022-11-10 Thread GitBox
zhengruifeng opened a new pull request, #38612: URL: https://github.com/apache/spark/pull/38612 ### What changes were proposed in this pull request? Control the max size of arrow batch ### Why are the changes needed? as per the suggestion https://github.com/apache/sp

[GitHub] [spark] zhengchenyu commented on pull request #37949: [SPARK-40504][YARN] Make yarn appmaster load config from client

2022-11-10 Thread GitBox
zhengchenyu commented on PR #37949: URL: https://github.com/apache/spark/pull/37949#issuecomment-1311193343 @xkrogen Thanks for your review. In our cluster, YARN_CONF_DIR is same with HADOOP_CONF_DIR. I add some key information about the failed application. ``` # some key i

[GitHub] [spark] yabola commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-10 Thread GitBox
yabola commented on PR #38560: URL: https://github.com/apache/spark/pull/38560#issuecomment-1311193090 my latest implementation no longer passes reduceIds from driver, there are still some code style improvements, just some rough implementation for now -- This is an automated message from

[GitHub] [spark] pan3793 commented on pull request #38596: [SPARK-41093][BUILD] Remove netty-tcnative-classes from Spark dependencyList

2022-11-10 Thread GitBox
pan3793 commented on PR #38596: URL: https://github.com/apache/spark/pull/38596#issuecomment-1311188998 This patch only suitable to master. - branch-3.2 and earlier use the fat netty-all, no issue; - branch-3.3 depends on netty 4.1.74, which claims `netty-tcnative-classes` as compi

[GitHub] [spark] xinrong-meng opened a new pull request, #38611: [SPARK-41107] Install memory-profiler in the CI

2022-11-10 Thread GitBox
xinrong-meng opened a new pull request, #38611: URL: https://github.com/apache/spark/pull/38611 ### What changes were proposed in this pull request? Install [memory-profiler](https://pypi.org/project/memory-profiler/) in CI in order to enable memory profiling tests. ### Why are the

[GitHub] [spark] HyukjinKwon commented on pull request #38599: [SPARK-41063][BUILD] Clean all except files in Git repository before running Mima

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38599: URL: https://github.com/apache/spark/pull/38599#issuecomment-1311184802 Sorry actually I am reverting this. Seems like it's related .. surprisingly .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] vinodkc commented on pull request #38608: [SPARK-41080][SQL] Support Bit manipulation function SETBIT

2022-11-10 Thread GitBox
vinodkc commented on PR #38608: URL: https://github.com/apache/spark/pull/38608#issuecomment-1311184404 CC @cloud-fan , @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] zhengruifeng commented on pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching

2022-11-10 Thread GitBox
zhengruifeng commented on PR #38546: URL: https://github.com/apache/spark/pull/38546#issuecomment-1311182989 merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

[GitHub] [spark] zhengruifeng closed pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching

2022-11-10 Thread GitBox
zhengruifeng closed pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching URL: https://github.com/apache/spark/pull/38546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-10 Thread GitBox
SandishKumarHN commented on code in PR #38603: URL: https://github.com/apache/spark/pull/38603#discussion_r1019796098 ## python/pyspark/sql/protobuf/functions.py: ## @@ -49,7 +49,10 @@ def from_protobuf( data : :class:`~pyspark.sql.Column` or str the binary column.

[GitHub] [spark] HyukjinKwon commented on pull request #38609: [WIP][SPARK-40593][CONNECT] Add profile to make user can specify custom `protocExecutable` and `pluginExecutable` when building connect m

2022-11-10 Thread GitBox
HyukjinKwon commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1311181590 How do we get the user-defined protobuf executables for `CONNECT_PROTOC_EXEC_PATH` and `CONNECT_PLUGIN_EXEC_PATH` in CentOS 6 and 7? If this is the only way, I am fine but we should p

[GitHub] [spark] cloud-fan commented on a diff in pull request #38604: [SPARK-41102][CONNECT][REFACTORING] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019794236 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -50,9 +49,9 @@ class SparkConnectStreamHandler(respons

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38607: [SPARK-40938][CONNECT][PYTHON][FOLLOW-UP] Fix SubqueryAlias without the child plan when constructing Connect proto in the Pytho

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38607: URL: https://github.com/apache/spark/pull/38607#discussion_r1019793828 ## python/pyspark/sql/connect/plan.py: ## @@ -712,6 +712,8 @@ def __init__(self, child: Optional["LogicalPlan"], alias: str) -> None: def plan(self, session

[GitHub] [spark] cloud-fan commented on a diff in pull request #38604: [SPARK-41102][CONNECT][REFACTORING] Merge SparkConnectPlanner and SparkConnectCommandPlanner

2022-11-10 Thread GitBox
cloud-fan commented on code in PR #38604: URL: https://github.com/apache/spark/pull/38604#discussion_r1019793647 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -39,14 +46,17 @@ final case class InvalidPlanInput( pri

[GitHub] [spark] yabola commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-10 Thread GitBox
yabola commented on PR #38560: URL: https://github.com/apache/spark/pull/38560#issuecomment-1311176907 @mridulm Yes...These two issues are the similar. @wankunde Can I continue editing my PR in this Issue? -- This is an automated message from the Apache Git Service. To respond to the mess

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-10 Thread GitBox
HyukjinKwon commented on code in PR #38603: URL: https://github.com/apache/spark/pull/38603#discussion_r1019792541 ## python/pyspark/sql/protobuf/functions.py: ## @@ -49,7 +49,10 @@ def from_protobuf( data : :class:`~pyspark.sql.Column` or str the binary column.

[GitHub] [spark] cloud-fan commented on pull request #38606: [SPARK-41105][CONNECT] Adopt `optional` keyword from proto3 which offers `hasXXX` to differentiate if a field is set or unset

2022-11-10 Thread GitBox
cloud-fan commented on PR #38606: URL: https://github.com/apache/spark/pull/38606#issuecomment-1311175860 There are still some fields documented as optional but don't use the `optional` keywork. Do we need to change them? -- This is an automated message from the Apache Git Service. To res

  1   2   3   4   >