[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


HyukjinKwon commented on code in PR #41711:
URL: https://github.com/apache/spark/pull/41711#discussion_r1240603584


##
dev/error_message_refiner.py:
##
@@ -0,0 +1,235 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Utility for refining error messages based on LLM.
+
+Usage:
+python error_message_refiner.py  [--gpt_version=]
+
+Arguments:
+   Required.
+The name of the error class to refine the messages 
for.
+The list of error classes is located in
+`core/src/main/resources/error/error-classes.json`.
+
+Options:
+--gpt_version= Optional.
+The version of Chat GPT to use for refining the 
error messages.
+If not provided, the default 
version("gpt-3.5-turbo") will be used.
+
+Example usage:
+python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4
+
+Description:
+This script refines error messages using the LLM based approach.
+It takes the name of the error class as a required argument and, 
optionally,
+allows specifying the version of Chat GPT to use for refining the messages.
+
+Options:
+--gpt_version: Specifies the version of Chat GPT.
+   If not provided, the default version("gpt-3.5-turbo") 
will be used.
+
+Note:
+- Ensure that the necessary dependencies are installed before running the 
script.
+- Ensure that the valid API key is entered in the `api-key.txt`.
+- The refined error messages will be displayed in the console output.
+- To use the gpt-4 model, you need to join the waitlist. Please refer to
+  https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for 
more details.
+"""
+
+import argparse
+import json
+import openai
+import re
+import subprocess
+import random
+from typing import Tuple, Optional

Review Comment:
   add a newline after this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


HyukjinKwon commented on code in PR #41711:
URL: https://github.com/apache/spark/pull/41711#discussion_r1240603499


##
dev/api_key.txt:
##
@@ -0,0 +1 @@
+# Please REMOVE this comment and enter the API key here. You can obtain an API 
key from `https://platform.openai.com/account/api-keys`.

Review Comment:
   Can we use env variable instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


HyukjinKwon commented on code in PR #41711:
URL: https://github.com/apache/spark/pull/41711#discussion_r1240603443


##
dev/error_message_refiner.py:
##
@@ -0,0 +1,235 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Utility for refining error messages based on LLM.
+
+Usage:
+python error_message_refiner.py  [--gpt_version=]
+
+Arguments:
+   Required.
+The name of the error class to refine the messages 
for.
+The list of error classes is located in
+`core/src/main/resources/error/error-classes.json`.
+
+Options:
+--gpt_version= Optional.
+The version of Chat GPT to use for refining the 
error messages.
+If not provided, the default 
version("gpt-3.5-turbo") will be used.
+
+Example usage:
+python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4
+
+Description:
+This script refines error messages using the LLM based approach.
+It takes the name of the error class as a required argument and, 
optionally,
+allows specifying the version of Chat GPT to use for refining the messages.
+
+Options:
+--gpt_version: Specifies the version of Chat GPT.
+   If not provided, the default version("gpt-3.5-turbo") 
will be used.
+
+Note:
+- Ensure that the necessary dependencies are installed before running the 
script.

Review Comment:
   What are the dependencies?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets commented on a diff in pull request #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen

2023-06-23 Thread via GitHub


bersprockets commented on code in PR #41712:
URL: https://github.com/apache/spark/pull/41712#discussion_r1240569151


##
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##
@@ -1685,4 +1685,24 @@ class JoinSuite extends QueryTest with 
SharedSparkSession with AdaptiveSparkPlan
   checkAnswer(sql(query), expected)
 }
   }
+
+  test("SPARK-44132: FULL OUTER JOIN by streamed column name fails with NPE") {

Review Comment:
   Did you manually test your original Java reproduction case to ensure that 
your fix also fixes that case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


itholic commented on code in PR #41711:
URL: https://github.com/apache/spark/pull/41711#discussion_r1240578537


##
dev/error_message_refiner.py:
##
@@ -0,0 +1,219 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Utility for refining error messages based on LLM.
+
+Usage:
+python error_message_refiner.py  [--gpt_version=]
+
+Arguments:
+   Required.
+The name of the error class to refine the messages 
for.
+The list of error classes is located in
+`core/src/main/resources/error/error-classes.json`.
+
+Options:
+--gpt_version= Optional.
+The version of Chat GPT to use for refining the 
error messages.
+If not provided, the default 
version("gpt-3.5-turbo") will be used.
+
+Example usage:
+python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4
+
+Description:
+This script refines error messages using the LLM based approach.
+It takes the name of the error class as a required argument and, 
optionally,
+allows specifying the version of Chat GPT to use for refining the messages.
+
+Options:
+--gpt_version: Specifies the version of Chat GPT.
+   If not provided, the default version("gpt-3.5-turbo") 
will be used.
+
+Note:
+- Ensure that the necessary dependencies are installed before running the 
script.
+- Ensure that the valid API key is entered in the `api-key.txt`.
+- The refined error messages will be displayed in the console output.
+- To use the gpt-4 model, you need to join the waitlist. Please refer to
+  https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for 
more details.
+"""
+
+import argparse
+import json
+import openai
+import re
+import subprocess
+import random
+from typing import Tuple, Optional
+from sparktestsupport import SPARK_HOME
+
+PATH_TO_ERROR_CLASS = 
f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json"
+PATH_TO_API_KEY = f"{SPARK_HOME}/dev/api_key.txt"
+
+# You can obtain an API key from https://platform.openai.com/account/api-keys
+openai.api_key = open(PATH_TO_API_KEY).read().rstrip("\n")
+
+
+def _git_grep_files(search_string: str, exclude: str = None) -> str:

Review Comment:
   Awesome. Just added the comments, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join

2023-06-23 Thread via GitHub


szehon-ho commented on code in PR #41614:
URL: https://github.com/apache/spark/pull/41614#discussion_r1240571397


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -578,7 +594,9 @@ case class ShuffledHashJoinExec(
  |
  |  if (!$foundMatch) {
  |$buildRow = null;
- |$consumeFullOuterJoinRow();
+ |if ($isFullOuterJoin) {
+ |  $consumeFullOuterJoinRow();
+ |}

Review Comment:
   I end up pulling this in separate if block, leaving `buildRow=null` in both 
case if  `!foundMatch`.  I was thinking its cleaner, to clear the state, and 
more in line with uniqueKey side.  Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join

2023-06-23 Thread via GitHub


szehon-ho commented on code in PR #41614:
URL: https://github.com/apache/spark/pull/41614#discussion_r1240571397


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -578,7 +594,9 @@ case class ShuffledHashJoinExec(
  |
  |  if (!$foundMatch) {
  |$buildRow = null;
- |$consumeFullOuterJoinRow();
+ |if ($isFullOuterJoin) {
+ |  $consumeFullOuterJoinRow();
+ |}

Review Comment:
   I end up pulling this in separate if block, leaving to set buildRow=null in 
both case if not found match.  I was thinking its cleaner, to clear the state, 
and more in line with uniqueKey side.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -578,7 +594,9 @@ case class ShuffledHashJoinExec(
  |
  |  if (!$foundMatch) {
  |$buildRow = null;
- |$consumeFullOuterJoinRow();
+ |if ($isFullOuterJoin) {
+ |  $consumeFullOuterJoinRow();
+ |}

Review Comment:
   I end up pulling this in separate if block, leaving to set buildRow=null in 
both case if not found match.  I was thinking its cleaner, to clear the state, 
and more in line with uniqueKey side.  Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #36374: [SPARK-39006][K8S] Show a directional error message for executor PVC dynamic allocation failure

2023-06-23 Thread via GitHub


dongjoon-hyun commented on PR #36374:
URL: https://github.com/apache/spark/pull/36374#issuecomment-1605232031

   Thank you for confirming. Apache Spark 3.4.1 is released officially.
   - https://lists.apache.org/list.html?d...@spark.apache.org
   - https://spark.apache.org/downloads.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join

2023-06-23 Thread via GitHub


szehon-ho commented on code in PR #41614:
URL: https://github.com/apache/spark/pull/41614#discussion_r1240569908


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -578,7 +594,9 @@ case class ShuffledHashJoinExec(
  |
  |  if (!$foundMatch) {
  |$buildRow = null;
- |$consumeFullOuterJoinRow();
+ |if ($isFullOuterJoin) {
+ |  $consumeFullOuterJoinRow();
+ |}

Review Comment:
   I was afraid of side effect if I don't set buildRow to null, if not 
fullOuterJoin.  Let me check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets commented on a diff in pull request #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen

2023-06-23 Thread via GitHub


bersprockets commented on code in PR #41712:
URL: https://github.com/apache/spark/pull/41712#discussion_r1240569151


##
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##
@@ -1685,4 +1685,24 @@ class JoinSuite extends QueryTest with 
SharedSparkSession with AdaptiveSparkPlan
   checkAnswer(sql(query), expected)
 }
   }
+
+  test("SPARK-44132: FULL OUTER JOIN by streamed column name fails with NPE") {

Review Comment:
   Did you manually test your original Java reproduction case to ensure that 
your fix also fixes that case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets commented on pull request #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen

2023-06-23 Thread via GitHub


bersprockets commented on PR #41712:
URL: https://github.com/apache/spark/pull/41712#issuecomment-1605228576

   For the title, may I suggest:
   ```
   [SPARK-44132][SQL] Materialize `Stream` of join column names to avoid 
codegen failure
   ```
   For the description, may I suggest:
   
   ### What changes were proposed in this pull request?
   
   Materialize passed join columns as an `IndexedSeq` before passing it to the 
lower layers.
   
   ### Why are the changes needed?
   
   When nesting multiple full outer joins using column names which are a 
`Stream`, the code generator will generate faulty code resulting in a NPE or 
bad `UnsafeRow` access at runtime. See the 2 added test cases.
   
   etc.. (the rest is good)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join

2023-06-23 Thread via GitHub


szehon-ho commented on code in PR #41614:
URL: https://github.com/apache/spark/pull/41614#discussion_r1240567843


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -364,7 +366,13 @@ case class ShuffledHashJoinExec(
   override def doProduce(ctx: CodegenContext): String = {
 // Specialize `doProduce` code for full outer join, because full outer 
join needs to
 // iterate streamed and build side separately.

Review Comment:
   Good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #41716: [SPARK-44164][SQL] Extract toAttribute method from StructField to Util class

2023-06-23 Thread via GitHub


amaliujia commented on PR #41716:
URL: https://github.com/apache/spark/pull/41716#issuecomment-1605197640

   @hvanhovell @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #41716: [SPARK-44164][SQL] Extract toAttribute method from StructField to Util class

2023-06-23 Thread via GitHub


amaliujia opened a new pull request, #41716:
URL: https://github.com/apache/spark/pull/41716

   
   
   ### What changes were proposed in this pull request?
   
   Extract toAttribute method from StructField to Util class. 
   
   ### Why are the changes needed?
   
   StructField should be moved to `sql/api` so `toAttribute` should be kept 
into Catalyst.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   existing tests. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38202: [SPARK-40763][K8S] Should expose driver service name to config for user features

2023-06-23 Thread via GitHub


github-actions[bot] closed pull request #38202: [SPARK-40763][K8S] Should 
expose driver service name to config for user features
URL: https://github.com/apache/spark/pull/38202


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38885: [WIP][SPARK-41367][SQL] Enable V2 file tables in read paths in session catalog

2023-06-23 Thread via GitHub


github-actions[bot] commented on PR #38885:
URL: https://github.com/apache/spark/pull/38885#issuecomment-1605183251

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #39985: [SPARK-42412][WIP] Initial prototype implementation of PySpark ML via Spark connect

2023-06-23 Thread via GitHub


github-actions[bot] closed pull request #39985: [SPARK-42412][WIP] Initial 
prototype implementation of PySpark ML via Spark connect
URL: https://github.com/apache/spark/pull/39985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #40391: [SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers

2023-06-23 Thread via GitHub


github-actions[bot] closed pull request #40391: [SPARK-42766][YARN] 
YarnAllocator filter excluded nodes when launching containers
URL: https://github.com/apache/spark/pull/40391


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df

2023-06-23 Thread via GitHub


github-actions[bot] closed pull request #40122: [SPARK-42349][PYTHON] Support 
pandas cogroup with multiple df
URL: https://github.com/apache/spark/pull/40122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #40411: [SPARK-42781][DOCS][PYTHON] provide one format for writing to kafka

2023-06-23 Thread via GitHub


github-actions[bot] closed pull request #40411: [SPARK-42781][DOCS][PYTHON] 
provide one format for writing  to kafka
URL: https://github.com/apache/spark/pull/40411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #40419: [SPARK-42789][SQL] Rewrite multiple GetJsonObjects to a JsonTuple if their json expressions are the same

2023-06-23 Thread via GitHub


github-actions[bot] closed pull request #40419: [SPARK-42789][SQL] Rewrite 
multiple GetJsonObjects to a JsonTuple if their json expressions are the same
URL: https://github.com/apache/spark/pull/40419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #41715: [SPARK-44163][PYTHON] Handle `ModuleNotFoundError` in addition to `ImportError`

2023-06-23 Thread via GitHub


dongjoon-hyun closed pull request #41715: [SPARK-44163][PYTHON] Handle 
`ModuleNotFoundError` in addition to `ImportError`
URL: https://github.com/apache/spark/pull/41715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures

2023-06-23 Thread via GitHub


dtenedor commented on PR #41486:
URL: https://github.com/apache/spark/pull/41486#issuecomment-1605060822

   Hi @MaxGekk can we trouble you to help us merge this please when you have a 
moment :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #41715: [SPARK-44163][PYTHON] Handle `ModuleNotFoundError` in addition to `ImportError`

2023-06-23 Thread via GitHub


dongjoon-hyun opened a new pull request, #41715:
URL: https://github.com/apache/spark/pull/41715

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures

2023-06-23 Thread via GitHub


dtenedor commented on code in PR #41486:
URL: https://github.com/apache/spark/pull/41486#discussion_r1240463584


##
core/src/main/resources/error/error-classes.json:
##
@@ -757,6 +757,21 @@
   "The expression  cannot be used as a grouping expression 
because its data type  is not an orderable data type."
 ]
   },
+  "HLL_INVALID_INPUT_SKETCH_BUFFER" : {
+"message" : [
+  "Invalid call to ; only valid HLL sketch buffers are supported 
as inputs (such as those produced by the HLL_SKETCH_AGG function)."

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures

2023-06-23 Thread via GitHub


dtenedor commented on code in PR #41486:
URL: https://github.com/apache/spark/pull/41486#discussion_r1240463321


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datasketchesExpressions.scala:
##
@@ -98,15 +104,22 @@ case class HllUnion(first: Expression, second: Expression, 
third: Expression)
   override def dataType: DataType = BinaryType
 
   override def nullSafeEval(value1: Any, value2: Any, value3: Any): Any = {
-val sketch1 = 
HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]]))
-val sketch2 = 
HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]]))
+val sketch1 = try {
+  HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]]))
+} catch {
+  case _: java.lang.Error =>
+throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName)
+}
+val sketch2 = try {
+  HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]]))
+} catch {
+  case _: java.lang.Error =>
+throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName)
+}
 val allowDifferentLgConfigK = value3.asInstanceOf[Boolean]
 if (!allowDifferentLgConfigK && sketch1.getLgConfigK != 
sketch2.getLgConfigK) {
-  throw new UnsupportedOperationException(
-"Sketches have different lgConfigK values: " +
-s"${sketch1.getLgConfigK} and ${sketch2.getLgConfigK}. " +
-"Set allowDifferentLgConfigK to true to enable unions of " +
-"different lgConfigK values.")
+  throw QueryExecutionErrors.hllUnionDifferentLgK(
+sketch1.getLgConfigK, sketch2.getLgConfigK, function = "HLL_UNION")

Review Comment:
   No worries, fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures

2023-06-23 Thread via GitHub


dtenedor commented on code in PR #41486:
URL: https://github.com/apache/spark/pull/41486#discussion_r1240463194


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala:
##
@@ -189,21 +189,20 @@ object HllSketchAgg {
   private val minLgConfigK = 4
   private val maxLgConfigK = 21
 
-  // Replicate Datasketche's HllUtil's checkLgK implementation, as we can't 
reference it directly
+  // Replicate Datasketches' HllUtil's checkLgK implementation, as we can't 
reference it directly.
   def checkLgK(lgConfigK: Int): Unit = {
 if (lgConfigK < minLgConfigK || lgConfigK > maxLgConfigK) {
-  throw new SketchesArgumentException(
-s"Log K must be between $minLgConfigK and $maxLgConfigK, inclusive: " 
+ lgConfigK)
+  throw QueryExecutionErrors.hllInvalidLgK(function = "HLL_SKETCH_AGG",

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mkaravel commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures

2023-06-23 Thread via GitHub


mkaravel commented on code in PR #41486:
URL: https://github.com/apache/spark/pull/41486#discussion_r1240413693


##
core/src/main/resources/error/error-classes.json:
##
@@ -757,6 +757,21 @@
   "The expression  cannot be used as a grouping expression 
because its data type  is not an orderable data type."
 ]
   },
+  "HLL_INVALID_INPUT_SKETCH_BUFFER" : {
+"message" : [
+  "Invalid call to ; only valid HLL sketch buffers are supported 
as inputs (such as those produced by the HLL_SKETCH_AGG function)."

Review Comment:
   ```suggestion
 "Invalid call to ; only valid HLL sketch buffers are 
supported as inputs (such as those produced by the `hll_sketch_agg` 
expression)."
   ```
   The functions/expression names are all lowercase (see that `prettyName` 
string for the expressions).



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala:
##
@@ -189,21 +189,20 @@ object HllSketchAgg {
   private val minLgConfigK = 4
   private val maxLgConfigK = 21
 
-  // Replicate Datasketche's HllUtil's checkLgK implementation, as we can't 
reference it directly
+  // Replicate Datasketches' HllUtil's checkLgK implementation, as we can't 
reference it directly.
   def checkLgK(lgConfigK: Int): Unit = {
 if (lgConfigK < minLgConfigK || lgConfigK > maxLgConfigK) {
-  throw new SketchesArgumentException(
-s"Log K must be between $minLgConfigK and $maxLgConfigK, inclusive: " 
+ lgConfigK)
+  throw QueryExecutionErrors.hllInvalidLgK(function = "HLL_SKETCH_AGG",

Review Comment:
   ```suggestion
 throw QueryExecutionErrors.hllInvalidLgK(function = "hll_sketch_agg",
   ```
   I missed this before (see also my other comment): we want to use the pretty 
name here, which is all lowercase.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datasketchesExpressions.scala:
##
@@ -98,15 +104,22 @@ case class HllUnion(first: Expression, second: Expression, 
third: Expression)
   override def dataType: DataType = BinaryType
 
   override def nullSafeEval(value1: Any, value2: Any, value3: Any): Any = {
-val sketch1 = 
HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]]))
-val sketch2 = 
HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]]))
+val sketch1 = try {
+  HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]]))
+} catch {
+  case _: java.lang.Error =>
+throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName)
+}
+val sketch2 = try {
+  HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]]))
+} catch {
+  case _: java.lang.Error =>
+throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName)
+}
 val allowDifferentLgConfigK = value3.asInstanceOf[Boolean]
 if (!allowDifferentLgConfigK && sketch1.getLgConfigK != 
sketch2.getLgConfigK) {
-  throw new UnsupportedOperationException(
-"Sketches have different lgConfigK values: " +
-s"${sketch1.getLgConfigK} and ${sketch2.getLgConfigK}. " +
-"Set allowDifferentLgConfigK to true to enable unions of " +
-"different lgConfigK values.")
+  throw QueryExecutionErrors.hllUnionDifferentLgK(
+sketch1.getLgConfigK, sketch2.getLgConfigK, function = "HLL_UNION")

Review Comment:
   ```suggestion
   sketch1.getLgConfigK, sketch2.getLgConfigK, function = "hll_union")
   ```
   Again, apologies for missing this in the previous rounds. I think we need to 
use the pretty name which is all lowercase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] szehon-ho commented on pull request #41683: [SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL

2023-06-23 Thread via GitHub


szehon-ho commented on PR #41683:
URL: https://github.com/apache/spark/pull/41683#issuecomment-1604982731

   Hm build failure doesnt seem related:  
   
   ```
   python/pyspark/mllib/clustering.py:781: error: Decorated property not 
supported  [misc]
   python/pyspark/mllib/clustering.py:949: error: Decorated property not 
supported  [misc]
   Found 199 errors in 24 files (checked 668 source files)
   1
   Error: Process completed with exit code 1.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #41714: [SPARK-44160][SQL][CONNECT] Extract shared code from StructType

2023-06-23 Thread via GitHub


amaliujia commented on PR #41714:
URL: https://github.com/apache/spark/pull/41714#issuecomment-1604972107

   cc @hvanhovell @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] siying commented on pull request #41578: [SPARK-44044][SS] Improve Error message for Window functions with streaming

2023-06-23 Thread via GitHub


siying commented on PR #41578:
URL: https://github.com/apache/spark/pull/41578#issuecomment-1604966953

   @MaxGekk I addressed the comments. The CI failure is on python link. I don't 
know why it happens but I don't see how it is related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #41714: [SPARK-44160][SQL][CONNECT] Extract shared code from StructType

2023-06-23 Thread via GitHub


amaliujia opened a new pull request, #41714:
URL: https://github.com/apache/spark/pull/41714

   
   
   ### What changes were proposed in this pull request?
   
   The StructType has some methods that require CatalystParser and Catalyst 
expression. We are not planning to move the parser and expression to the shared 
module thus needs to do code split to share as much as code as possible between 
Scala client and Catalyst.
   
   
   ### Why are the changes needed?
   
   To simplify DataType interface as a whole thus those become the shared 
interface.
   
   ### Does this PR introduce _any_ user-facing change?
   
   This will break binary compatibility on StructType
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`

2023-06-23 Thread via GitHub


dongjoon-hyun closed pull request #41713: [SPARK-44158][K8S] Remove unused 
`spark.kubernetes.executor.lostCheckmaxAttempts`
URL: https://github.com/apache/spark/pull/41713


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`

2023-06-23 Thread via GitHub


dongjoon-hyun commented on PR #41713:
URL: https://github.com/apache/spark/pull/41713#issuecomment-1604941186

   Thank you so much! Merged to master/3.4/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vicennial commented on pull request #41701: [CONNECT][SPARK-44146] Isolate Spark Connect Session jars and classfiles

2023-06-23 Thread via GitHub


vicennial commented on PR #41701:
URL: https://github.com/apache/spark/pull/41701#issuecomment-1604932773

   The PR is ready to review (the falling tests are flaky, I've submitted a 
rerun request). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`

2023-06-23 Thread via GitHub


dongjoon-hyun commented on PR #41713:
URL: https://github.com/apache/spark/pull/41713#issuecomment-1604904486

   Could you review this PR when you have some time, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jdesjean commented on pull request #41443: [SPARK-43923][CONNECT] Post listenerBus events during ExecutePlanRequest

2023-06-23 Thread via GitHub


jdesjean commented on PR #41443:
URL: https://github.com/apache/spark/pull/41443#issuecomment-1604796450

   > > @beliefer the event API is open by design, it is made for others to use 
it. For example thriftserver uses it. This particular PR will be foundation on 
which we will be building a UI for connect.
   > 
   > Thank you for the explanation. Can existing event types be shared by 
thriftserver and connect?
   
   The Thrift events are flexible and semantically the same as the Connect 
events. It's technically possible to reuse them, however there are some 
differences between both implementations. I believe it's preferable to create a 
new set of events so that both projects (Thrift & Connect) can evolve 
separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`

2023-06-23 Thread via GitHub


dongjoon-hyun opened a new pull request, #41713:
URL: https://github.com/apache/spark/pull/41713

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #41559: [SPARK-44030][SQL] Implement DataTypeExpression to offer Unapply for expression

2023-06-23 Thread via GitHub


amaliujia commented on code in PR #41559:
URL: https://github.com/apache/spark/pull/41559#discussion_r1240066813


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeExpression.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.types
+
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.errors.QueryExecutionErrors
+
+abstract class DataTypeExpression(val dataType: DataType) {
+  /**
+   * Enables matching against DataType for expressions:
+   * {{{
+   *   case Cast(child @ BinaryType(), StringType) =>
+   * ...
+   * }}}
+   */
+  private[sql] def unapply(e: Expression): Boolean = e.dataType == dataType
+}
+
+abstract class AtomicTypeExpression(override val dataType: DataType)
+  extends DataTypeExpression(dataType)
+
+abstract class DatetimeTypeExpression(override val dataType: DataType)
+  extends AtomicTypeExpression(dataType)
+
+case object BooleanTypeExpression extends DataTypeExpression(BooleanType)
+case object StringTypeExpression extends DataTypeExpression(StringType)
+case object TimestampTypeExpression extends DataTypeExpression(TimestampType)
+case object DateTypeExpression extends DataTypeExpression(DateType)
+case object ByteTypeExpression extends DataTypeExpression(ByteType)
+case object ShortTypeExpression extends DataTypeExpression(ShortType)
+case object IntegerTypeExpression extends DataTypeExpression(IntegerType)
+case object LongTypeExpression extends DataTypeExpression(LongType)
+case object DoubleTypeExpression extends DataTypeExpression(DoubleType)
+case object FloatTypeExpression extends DataTypeExpression(FloatType)
+
+object NumericTypeExpression {
+  /**
+   * Enables matching against NumericType for expressions:
+   * {{{
+   *   case Cast(child @ NumericType(), StringType) =>
+   * ...
+   * }}}
+   */
+  def unapply(e: Expression): Boolean = {

Review Comment:
   Well after a second thought, I think you are right that the implementation 
can be as easy as a few lines. Though all the tests have passed by the current 
way.
   
   I changed the implementation to `e.dataType.isInstanceOf[NumericType]` and 
same for `IntegralType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #41707: [SPARK-44151][BUILD] Upgrade `commons-codec` to 1.16.0

2023-06-23 Thread via GitHub


dongjoon-hyun commented on PR #41707:
URL: https://github.com/apache/spark/pull/41707#issuecomment-1604550160

   Merged to master for Apache Spark 3.5.0 Thank you, @panbingkun and 
@HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #41707: [SPARK-44151][BUILD] Upgrade `commons-codec` to 1.16.0

2023-06-23 Thread via GitHub


dongjoon-hyun closed pull request #41707: [SPARK-44151][BUILD] Upgrade 
`commons-codec` to 1.16.0
URL: https://github.com/apache/spark/pull/41707


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1240040307


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   @pan3793 . Here, the contact on `jmap` is just one of available binary, not 
the same JDK's `jmap`.
   
   However, I'm interested in this case. Could you make a small reproducer with 
Docker image?
   > Spark executor proccess respect JAVA_HOME to find $JAVA_HOME/bin/java but 
the subprocess can not find jmap even it knows where JAVA_HOME is.



##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   @pan3793 . Here, the contract on `jmap` is just one of available binary, not 
the same JDK's `jmap`.
   
   However, I'm interested in this case. Could you make a small reproducer with 
Docker image?
   > Spark executor proccess respect JAVA_HOME to find $JAVA_HOME/bin/java but 
the subprocess can not find jmap even it knows where JAVA_HOME is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41700: [SPARK-44139][SQL] Discard completely pushed down filters in group-based MERGE operations

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41700:
URL: https://github.com/apache/spark/pull/41700#discussion_r1240037361


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/GroupBasedRowLevelOperationScanPlanning.scala:
##
@@ -56,29 +62,67 @@ object GroupBasedRowLevelOperationScanPlanning extends 
Rule[LogicalPlan] with Pr
 s"""
|Pushing operators to ${relation.name}
|Pushed filters: $pushedFiltersStr
-   |Filters that were not pushed: ${remainingFilters.mkString(", ")}
+   |Filters evaluated on data source side: 
${evaluatedFilters.mkString(", ")}
+   |Filters evaluated on Spark side: ${postScanFilters.mkString(", ")}
|Output: ${output.mkString(", ")}
  """.stripMargin)
 
-  // replace DataSourceV2Relation with DataSourceV2ScanRelation for the 
row operation table
-  // there may be multiple read relations for UPDATEs that are rewritten 
as UNION
-  rd transform {
+  rd transformDown {
+// simplify the join condition in MERGE operations by discarding 
already evaluated filters
+case j @ Join(
+PhysicalOperation(_, _, r: DataSourceV2Relation), _, _, 
Some(cond), _)
+if rd.operation.command == MERGE && evaluatedFilters.nonEmpty && 
r.table.eq(table) =>
+  j.copy(condition = Some(optimizeMergeJoinCondition(cond, 
evaluatedFilters)))
+
+// replace DataSourceV2Relation with DataSourceV2ScanRelation for the 
row operation table
+// there may be multiple read relations for UPDATEs that are rewritten 
as UNION
 case r: DataSourceV2Relation if r.table eq table =>
   DataSourceV2ScanRelation(r, scan, 
PushDownUtils.toOutputAttrs(scan.readSchema(), r))
   }
   }
 
+  // pushes down the operation condition and returns the following information:
+  // - pushed down filters
+  // - filter expressions that are fully evaluated on the data source side
+  //   (such filters can be discarded and don't have to be evaluated again on 
the Spark side)
+  // - post-scan filter expressions that must be evaluated on the Spark side
+  //   (such filters can overlap with pushed down filters, e.g. Parquet row 
group filtering)
   private def pushFilters(
   cond: Expression,
   tableAttrs: Seq[AttributeReference],
-  scanBuilder: ScanBuilder): (Either[Seq[Filter], Seq[V2Filter]], 
Seq[Expression]) = {
+  scanBuilder: ScanBuilder)
+  : (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression], Seq[Expression]) = {

Review Comment:
   In that case, since it's orthogonal to the main features, let's consider it 
later separately. For now, we don't need to change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #41700: [SPARK-44139][SQL] Discard completely pushed down filters in group-based MERGE operations

2023-06-23 Thread via GitHub


aokolnychyi commented on code in PR #41700:
URL: https://github.com/apache/spark/pull/41700#discussion_r1240020952


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/GroupBasedRowLevelOperationScanPlanning.scala:
##
@@ -56,29 +62,67 @@ object GroupBasedRowLevelOperationScanPlanning extends 
Rule[LogicalPlan] with Pr
 s"""
|Pushing operators to ${relation.name}
|Pushed filters: $pushedFiltersStr
-   |Filters that were not pushed: ${remainingFilters.mkString(", ")}
+   |Filters evaluated on data source side: 
${evaluatedFilters.mkString(", ")}
+   |Filters evaluated on Spark side: ${postScanFilters.mkString(", ")}
|Output: ${output.mkString(", ")}
  """.stripMargin)
 
-  // replace DataSourceV2Relation with DataSourceV2ScanRelation for the 
row operation table
-  // there may be multiple read relations for UPDATEs that are rewritten 
as UNION
-  rd transform {
+  rd transformDown {
+// simplify the join condition in MERGE operations by discarding 
already evaluated filters
+case j @ Join(
+PhysicalOperation(_, _, r: DataSourceV2Relation), _, _, 
Some(cond), _)
+if rd.operation.command == MERGE && evaluatedFilters.nonEmpty && 
r.table.eq(table) =>
+  j.copy(condition = Some(optimizeMergeJoinCondition(cond, 
evaluatedFilters)))
+
+// replace DataSourceV2Relation with DataSourceV2ScanRelation for the 
row operation table
+// there may be multiple read relations for UPDATEs that are rewritten 
as UNION
 case r: DataSourceV2Relation if r.table eq table =>
   DataSourceV2ScanRelation(r, scan, 
PushDownUtils.toOutputAttrs(scan.readSchema(), r))
   }
   }
 
+  // pushes down the operation condition and returns the following information:
+  // - pushed down filters
+  // - filter expressions that are fully evaluated on the data source side
+  //   (such filters can be discarded and don't have to be evaluated again on 
the Spark side)
+  // - post-scan filter expressions that must be evaluated on the Spark side
+  //   (such filters can overlap with pushed down filters, e.g. Parquet row 
group filtering)
   private def pushFilters(
   cond: Expression,
   tableAttrs: Seq[AttributeReference],
-  scanBuilder: ScanBuilder): (Either[Seq[Filter], Seq[V2Filter]], 
Seq[Expression]) = {
+  scanBuilder: ScanBuilder)
+  : (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression], Seq[Expression]) = {

Review Comment:
   I was debating something like this:
   
   ```
   type PushedFilterInfo = (Either[Seq[Filter], Seq[V2Filter]], 
Seq[Expression], Seq[Expression])
   ```
   
   ```
   private def pushFilters(
   cond: Expression,
   tableAttrs: Seq[AttributeReference],
   scanBuilder: ScanBuilder): PushedFilterInfo = {
 ...
   }
   ```
   
   But then decided it wasn't worth it unless you think it makes the code more 
readable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #41700: [SPARK-44139][SQL] Discard completely pushed down filters in group-based MERGE operations

2023-06-23 Thread via GitHub


aokolnychyi commented on PR #41700:
URL: https://github.com/apache/spark/pull/41700#issuecomment-1604524387

   Thank you, @dongjoon-hyun!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


gatorsmile commented on code in PR #41711:
URL: https://github.com/apache/spark/pull/41711#discussion_r1239940518


##
dev/error_message_refiner.py:
##
@@ -0,0 +1,219 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Utility for refining error messages based on LLM.
+
+Usage:
+python error_message_refiner.py  [--gpt_version=]
+
+Arguments:
+   Required.
+The name of the error class to refine the messages 
for.
+The list of error classes is located in
+`core/src/main/resources/error/error-classes.json`.
+
+Options:
+--gpt_version= Optional.
+The version of Chat GPT to use for refining the 
error messages.
+If not provided, the default 
version("gpt-3.5-turbo") will be used.
+
+Example usage:
+python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4
+
+Description:
+This script refines error messages using the LLM based approach.
+It takes the name of the error class as a required argument and, 
optionally,
+allows specifying the version of Chat GPT to use for refining the messages.
+
+Options:
+--gpt_version: Specifies the version of Chat GPT.
+   If not provided, the default version("gpt-3.5-turbo") 
will be used.
+
+Note:
+- Ensure that the necessary dependencies are installed before running the 
script.
+- Ensure that the valid API key is entered in the `api-key.txt`.
+- The refined error messages will be displayed in the console output.
+- To use the gpt-4 model, you need to join the waitlist. Please refer to
+  https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for 
more details.
+"""
+
+import argparse
+import json
+import openai
+import re
+import subprocess
+import random
+from typing import Tuple, Optional
+from sparktestsupport import SPARK_HOME
+
+PATH_TO_ERROR_CLASS = 
f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json"
+PATH_TO_API_KEY = f"{SPARK_HOME}/dev/api_key.txt"
+
+# You can obtain an API key from https://platform.openai.com/account/api-keys
+openai.api_key = open(PATH_TO_API_KEY).read().rstrip("\n")
+
+
+def _git_grep_files(search_string: str, exclude: str = None) -> str:

Review Comment:
   How about asking LLM for generating the comments ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steven-aerts commented on pull request #41688: [WIP][SPARK-44132][SQL][TESTS] nesting full outer join confuses codegen

2023-06-23 Thread via GitHub


steven-aerts commented on PR #41688:
URL: https://github.com/apache/spark/pull/41688#issuecomment-1604244490

   Eclipsed by #41712.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steven-aerts closed pull request #41688: [WIP][SPARK-44132][SQL][TESTS] nesting full outer join confuses codegen

2023-06-23 Thread via GitHub


steven-aerts closed pull request #41688: [WIP][SPARK-44132][SQL][TESTS] nesting 
full outer join confuses codegen
URL: https://github.com/apache/spark/pull/41688


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steven-aerts opened a new pull request, #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen

2023-06-23 Thread via GitHub


steven-aerts opened a new pull request, #41712:
URL: https://github.com/apache/spark/pull/41712

   When nesting multiple joins using column names which are Stream the lazily 
evaluated stream can sometimes generate faulty codegen data. Resulting in a NPE 
or bad access of the data.
   
   
   
   ### What changes were proposed in this pull request?
   By materializing it as an IndexedSeq, before passing it to the lower layers, 
we prevent generating the wrong code which accesses the wrong data.
   
   
   
   ### Why are the changes needed?
   Otherwise the code will crash, see the 2 added test cases.  Which show an 
NPE and a bad `UnsafeRow` access.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, only bug fix.
   
   
   
   ### How was this patch tested?
   A reproduction scenario was created and added to the code base.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] 0xdarkman commented on pull request #36374: [SPARK-39006][K8S] Show a directional error message for executor PVC dynamic allocation failure

2023-06-23 Thread via GitHub


0xdarkman commented on PR #36374:
URL: https://github.com/apache/spark/pull/36374#issuecomment-1604226893

   I am running this version 3.4.1 and it looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


itholic commented on PR #41711:
URL: https://github.com/apache/spark/pull/41711#issuecomment-1604003249

   BTW, I set the category as `SQL` because it works based on the SQL error 
class, but I'm not sure if this is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


itholic commented on PR #41711:
URL: https://github.com/apache/spark/pull/41711#issuecomment-1603986988

   cc @gengliangwang @allisonwang-db @MaxGekk @cloud-fan could you take a look 
at this PR when you find some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread via GitHub


itholic opened a new pull request, #41711:
URL: https://github.com/apache/spark/pull/41711

   ### What changes were proposed in this pull request?
   
   This PR proposes adding a utility script that can help increase productivity 
in error message improvement tasks.
   
   When passing the error class name to the script, detailed information(e.g. 
error message, source code of the function where the error actually occurs) 
corresponding to the error class is sent to ChatGPT to request improvement of 
the error message.
   
   **Usage example:**
   
   ```
   haejoon.lee@spark dev % python error_message_refiner.py 
AMBIGUOUS_LATERAL_COLUMN_ALIAS
   [AMBIGUOUS_LATERAL_COLUMN_ALIAS]
   Before: Lateral column alias  is ambiguous and has  matches.
   After: Lateral column alias  is ambiguous and has  matches. Please 
ensure that the referenced lateral column alias is uniquely identified, or 
provide a more specific alias in your query.
   ```
   
   Developers only need to review the printed Before/After and apply it to the 
actual code if the content is appropriate. (Of course, extra refining should be 
done if necessary)
   
   ### Why are the changes needed?
   
   To enhance productivity in error message improvement tasks.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, this is an effort to improve productivity for developers.
   
   
   ### How was this patch tested?
   
   Manually tested.
   
   https://github.com/apache/spark/pull/41504 is an actual example that 
demonstrates the utilization of this script.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


pan3793 commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239566530


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   I suppose the `$JAVA_HOME/bin/jmap`(in above case, should be 
`/opt/openjdk-11/bin/jmap` instead of `/opt/openjdk-8/bin/jmap`) should be 
used, instead of the one first present in `PATH`.
   
   And if only `JAVA_HOME` is set, but no `jmap` in `PATH`, the invocation will 
fail. e.g.
   
   The JDK was installed by TGZ, it was unarchived to `/opt/openjdk-11` 
whithout exposing to `PATH` nor creating softlink to `/usr/bin`, then Spark 
executor proccess respect `JAVA_HOME` to find `$JAVA_HOME/bin/java` but the 
subprocess can not find `jmap` even it knows where `JAVA_HOME` is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


pan3793 commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239566530


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   I suppose the `$JAVA_HOME/bin/jmap`(in above case, should be 
`/opt/openjdk-11/bin/jmap` instead of `/opt/openjdk-8/bin/jmap`) should be 
used, instead of the one first present in `PATH`.
   
   And if only `JAVA_HOME` is set, but no `jmap` in `PATH`, the invocation will 
fail. e.g.
   
   The JDK was installed by TGZ, it is unarchived to `/opt/openjdk-11` whithout 
exposing to `PATH` nor creating softlink to `/usr/bin`, then Spark executor 
proccess respect `JAVA_HOME` to find `$JAVA_HOME/bin/java` but the subprocess 
can not find `jmap` even it knows where `JAVA_HOME` is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


pan3793 commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239566530


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   I suppose the `$JAVA_HOME/bin/jmap`(in above case, should be 
`/opt/openjdk-11/bin/jmap` instead of `/opt/openjdk-8/bin/jmap`) should be 
used, instead of the one first present in `PATH`.
   
   And if only `JAVA_HOME` is set, but no `jmap` in `PATH`, the invocation will 
fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zzzzming95 commented on a diff in pull request #41154: [SPARK-43327][CORE][3.3] Trigger `committer.setupJob` before plan execute in `FileFormatWriter#write`

2023-06-23 Thread via GitHub


ming95 commented on code in PR #41154:
URL: https://github.com/apache/spark/pull/41154#discussion_r1239560157


##
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala:
##
@@ -1275,4 +1275,27 @@ class DataFrameReaderWriterSuite extends QueryTest with 
SharedSparkSession with
   }
 }
   }
+
+  test("SPARK-43327: location exists when insertoverwrite fails") {
+withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") {
+  withTable("t", "t1") {
+sql("create table t(c1 int) using parquet")
+sql("create table t1(c2 long) using parquet")
+sql("INSERT OVERWRITE TABLE t1 select 644164")
+
+//  spark.sql("CREATE TABLE IF NOT EXISTS t(amt1 int) using ORC")

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239558578


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   Like you pointed, there is no problem to run `jmap`, isn't it, @pan3793 ? 
IIRC, `jmap` is just CLI command without any format change. Especially, if you 
are saying JDKs here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239558578


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   Like you pointed, there is no problem to run `jmap`, isn't it, @pan3793 ? 
It's just CLI command without any format change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #41677: [WIP][SPARK-35564][SQL] Improve subexpression elimination

2023-06-23 Thread via GitHub


peter-toth commented on PR #41677:
URL: https://github.com/apache/spark/pull/41677#issuecomment-1603934838

   The failure in `[Run / Linters, licenses, dependencies and documentation 
generation]` seems unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


pan3793 commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239521669


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   @dongjoon-hyun I mean where to find `jmap`, not environment variables 
propagation.
   
   Consider the following case:
   
   There is a chance that the multi JDK installed on the machine, let's say we 
have 8 and 11 install at
   ```
   /opt/openjdk-8
   /opt/openjdk-11
   ```
   and commands of 8 are added into `PATH`, `PATH=/opt/openjdk-8/bin:$PATH`.
   
   If the Spark executor process runs with `JAVA_HOME=/opt/openjdk-11`, with 
such invocation, even the sub-process know `JAVA_HOME` but still search `jmap` 
from `PATH`, then `/opt/openjdk-8/bin/jmap` will be used.
   
   ```
   new ProcessBuilder("jmap", ...)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


pan3793 commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239521669


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   @dongjoon-hyun I mean where to find `jmap`, not environment variables 
propagation.
   
   Consider the following case:
   
   There is a chance that the multi JDK installed in the machine, let's say we 
have 8 and 11 install at
   ```
   /opt/openjdk-8
   /opt/openjdk-11
   ```
   and commands of 8 are added into `PATH`, `PATH=/opt/openjdk-8/bin:$PATH`.
   
   If the Spark executor process runs with `JAVA_HOME=/opt/openjdk-11`, with 
such invocation, even the sub-process know `JAVA_HOME` but still search `jmap` 
from `PATH`, then `/opt/openjdk-8/bin/jmap` will be used.
   
   ```
   new ProcessBuilder("jmap", ...)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on PR #41709:
URL: https://github.com/apache/spark/pull/41709#issuecomment-1603874002

   Merged to master for Apache Spark 3.5.0.
   Thank you, @viirya and @pan3793 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun closed pull request #41709: [SPARK-44153][CORE][UI] Support `Heap 
Histogram` column in `Executors` tab
URL: https://github.com/apache/spark/pull/41709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df

2023-06-23 Thread via GitHub


EnricoMi commented on PR #40122:
URL: https://github.com/apache/spark/pull/40122#issuecomment-1603851057

   @HyukjinKwon @cloud-fan @sunchao @zhengruifeng this is so sad.
   
   Can we get some feedback if this contribution is welcome or if any more 
effort is just wasted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #41710: [SPARK-43255][SQL] Replace the error class _LEGACY_ERROR_TEMP_2020 by an internal error

2023-06-23 Thread via GitHub


amaliujia commented on PR #41710:
URL: https://github.com/apache/spark/pull/41710#issuecomment-1603761304

   I did a search and looks like `constructorNotFoundError` may fit into 
internal errors which throws when a constructor is not found during reflection 
related operation, which is also at last hard to be triggered by user facing 
API.
   
   so this change is fine to me.
   
   But prefer @MaxGekk for another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239373825


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   And, the child process inherits the current process's environment, @pan3793 .
   - https://docs.oracle.com/javase/8/docs/api/java/lang/ProcessBuilder.html
   > an environment, which is a system-dependent mapping from variables to 
values. The initial value is a copy of the environment of the current process 
(see 
[System.getenv()](https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#getenv--)).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #41708: [SPARK-44135][PYTHON][CONNECT][DOCS] Document Spark Connect only API in PySpark

2023-06-23 Thread via GitHub


HyukjinKwon closed pull request #41708: [SPARK-44135][PYTHON][CONNECT][DOCS] 
Document Spark Connect only API in PySpark
URL: https://github.com/apache/spark/pull/41708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #41708: [SPARK-44135][PYTHON][CONNECT][DOCS] Document Spark Connect only API in PySpark

2023-06-23 Thread via GitHub


HyukjinKwon commented on PR #41708:
URL: https://github.com/apache/spark/pull/41708#issuecomment-1603750548

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239373825


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   And, the child process inherits the current process's environment.
   - https://docs.oracle.com/javase/8/docs/api/java/lang/ProcessBuilder.html
   > an environment, which is a system-dependent mapping from variables to 
values. The initial value is a copy of the environment of the current process 
(see 
[System.getenv()](https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#getenv--)).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


dongjoon-hyun commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239373194


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   We are inside `JVM` already.
   ```
   $ docker run -it --rm openjdk:11 bash
   root@d7d08a90ec41:/# jshell
   Jun 23, 2023 6:30:14 AM java.util.prefs.FileSystemPreferences$1 run
   INFO: Created user preferences directory.
   |  Welcome to JShell -- Version 11.0.16
   |  For an introduction type: /help intro
   
   jshell> var p = new ProcessBuilder("jmap", "-histo:live", 
String.valueOf(ProcessHandle.current().pid())).start()
   p ==> Process[pid=77, exitValue="not exited"]
   
   jshell> var r = new BufferedReader(new InputStreamReader(p.getInputStream()))
   r ==> java.io.BufferedReader@5fa7e7ff
   
   jshell> var line = ""
   line ==> ""
   
   jshell> while (line != null) { line = r.readLine(); 
System.out.println(line); }
num #instances #bytes  class name (module)
   ---
  1:  8938 412040  [B (java.base@11.0.16)
  2:  8705 208920  java.lang.String (java.base@11.0.16)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab

2023-06-23 Thread via GitHub


pan3793 commented on code in PR #41709:
URL: https://github.com/apache/spark/pull/41709#discussion_r1239357126


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with 
SparkClassUtils {
 }.map(threadInfoToThreadStackTrace)
   }
 
+  /** Return a heap dump. Used to capture dumps for the web UI */
+  def getHeapHistogram(): Array[String] = {
+val pid = String.valueOf(ProcessHandle.current().pid())
+val builder = new ProcessBuilder("jmap", "-histo:live", pid)

Review Comment:
   should it respect `JAVA_HOME` then `PATH`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org