[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
HyukjinKwon commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1240603584 ## dev/error_message_refiner.py: ## @@ -0,0 +1,235 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--gpt_version=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in +`core/src/main/resources/error/error-classes.json`. + +Options: +--gpt_version= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--gpt_version: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +from typing import Tuple, Optional Review Comment: add a newline after this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
HyukjinKwon commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1240603499 ## dev/api_key.txt: ## @@ -0,0 +1 @@ +# Please REMOVE this comment and enter the API key here. You can obtain an API key from `https://platform.openai.com/account/api-keys`. Review Comment: Can we use env variable instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
HyukjinKwon commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1240603443 ## dev/error_message_refiner.py: ## @@ -0,0 +1,235 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--gpt_version=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in +`core/src/main/resources/error/error-classes.json`. + +Options: +--gpt_version= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--gpt_version: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. Review Comment: What are the dependencies? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on a diff in pull request #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen
bersprockets commented on code in PR #41712: URL: https://github.com/apache/spark/pull/41712#discussion_r1240569151 ## sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala: ## @@ -1685,4 +1685,24 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan checkAnswer(sql(query), expected) } } + + test("SPARK-44132: FULL OUTER JOIN by streamed column name fails with NPE") { Review Comment: Did you manually test your original Java reproduction case to ensure that your fix also fixes that case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
itholic commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1240578537 ## dev/error_message_refiner.py: ## @@ -0,0 +1,219 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--gpt_version=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in +`core/src/main/resources/error/error-classes.json`. + +Options: +--gpt_version= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--gpt_version: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +from typing import Tuple, Optional +from sparktestsupport import SPARK_HOME + +PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json" +PATH_TO_API_KEY = f"{SPARK_HOME}/dev/api_key.txt" + +# You can obtain an API key from https://platform.openai.com/account/api-keys +openai.api_key = open(PATH_TO_API_KEY).read().rstrip("\n") + + +def _git_grep_files(search_string: str, exclude: str = None) -> str: Review Comment: Awesome. Just added the comments, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join
szehon-ho commented on code in PR #41614: URL: https://github.com/apache/spark/pull/41614#discussion_r1240571397 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala: ## @@ -578,7 +594,9 @@ case class ShuffledHashJoinExec( | | if (!$foundMatch) { |$buildRow = null; - |$consumeFullOuterJoinRow(); + |if ($isFullOuterJoin) { + | $consumeFullOuterJoinRow(); + |} Review Comment: I end up pulling this in separate if block, leaving `buildRow=null` in both case if `!foundMatch`. I was thinking its cleaner, to clear the state, and more in line with uniqueKey side. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join
szehon-ho commented on code in PR #41614: URL: https://github.com/apache/spark/pull/41614#discussion_r1240571397 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala: ## @@ -578,7 +594,9 @@ case class ShuffledHashJoinExec( | | if (!$foundMatch) { |$buildRow = null; - |$consumeFullOuterJoinRow(); + |if ($isFullOuterJoin) { + | $consumeFullOuterJoinRow(); + |} Review Comment: I end up pulling this in separate if block, leaving to set buildRow=null in both case if not found match. I was thinking its cleaner, to clear the state, and more in line with uniqueKey side. ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala: ## @@ -578,7 +594,9 @@ case class ShuffledHashJoinExec( | | if (!$foundMatch) { |$buildRow = null; - |$consumeFullOuterJoinRow(); + |if ($isFullOuterJoin) { + | $consumeFullOuterJoinRow(); + |} Review Comment: I end up pulling this in separate if block, leaving to set buildRow=null in both case if not found match. I was thinking its cleaner, to clear the state, and more in line with uniqueKey side. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36374: [SPARK-39006][K8S] Show a directional error message for executor PVC dynamic allocation failure
dongjoon-hyun commented on PR #36374: URL: https://github.com/apache/spark/pull/36374#issuecomment-1605232031 Thank you for confirming. Apache Spark 3.4.1 is released officially. - https://lists.apache.org/list.html?d...@spark.apache.org - https://spark.apache.org/downloads.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join
szehon-ho commented on code in PR #41614: URL: https://github.com/apache/spark/pull/41614#discussion_r1240569908 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala: ## @@ -578,7 +594,9 @@ case class ShuffledHashJoinExec( | | if (!$foundMatch) { |$buildRow = null; - |$consumeFullOuterJoinRow(); + |if ($isFullOuterJoin) { + | $consumeFullOuterJoinRow(); + |} Review Comment: I was afraid of side effect if I don't set buildRow to null, if not fullOuterJoin. Let me check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on a diff in pull request #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen
bersprockets commented on code in PR #41712: URL: https://github.com/apache/spark/pull/41712#discussion_r1240569151 ## sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala: ## @@ -1685,4 +1685,24 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan checkAnswer(sql(query), expected) } } + + test("SPARK-44132: FULL OUTER JOIN by streamed column name fails with NPE") { Review Comment: Did you manually test your original Java reproduction case to ensure that your fix also fixes that case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on pull request #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen
bersprockets commented on PR #41712: URL: https://github.com/apache/spark/pull/41712#issuecomment-1605228576 For the title, may I suggest: ``` [SPARK-44132][SQL] Materialize `Stream` of join column names to avoid codegen failure ``` For the description, may I suggest: ### What changes were proposed in this pull request? Materialize passed join columns as an `IndexedSeq` before passing it to the lower layers. ### Why are the changes needed? When nesting multiple full outer joins using column names which are a `Stream`, the code generator will generate faulty code resulting in a NPE or bad `UnsafeRow` access at runtime. See the 2 added test cases. etc.. (the rest is good) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] szehon-ho commented on a diff in pull request #41614: [SPARK-44060][SQL] Code-gen for build side outer shuffled hash join
szehon-ho commented on code in PR #41614: URL: https://github.com/apache/spark/pull/41614#discussion_r1240567843 ## sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala: ## @@ -364,7 +366,13 @@ case class ShuffledHashJoinExec( override def doProduce(ctx: CodegenContext): String = { // Specialize `doProduce` code for full outer join, because full outer join needs to // iterate streamed and build side separately. Review Comment: Good catch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #41716: [SPARK-44164][SQL] Extract toAttribute method from StructField to Util class
amaliujia commented on PR #41716: URL: https://github.com/apache/spark/pull/41716#issuecomment-1605197640 @hvanhovell @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #41716: [SPARK-44164][SQL] Extract toAttribute method from StructField to Util class
amaliujia opened a new pull request, #41716: URL: https://github.com/apache/spark/pull/41716 ### What changes were proposed in this pull request? Extract toAttribute method from StructField to Util class. ### Why are the changes needed? StructField should be moved to `sql/api` so `toAttribute` should be kept into Catalyst. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #38202: [SPARK-40763][K8S] Should expose driver service name to config for user features
github-actions[bot] closed pull request #38202: [SPARK-40763][K8S] Should expose driver service name to config for user features URL: https://github.com/apache/spark/pull/38202 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38885: [WIP][SPARK-41367][SQL] Enable V2 file tables in read paths in session catalog
github-actions[bot] commented on PR #38885: URL: https://github.com/apache/spark/pull/38885#issuecomment-1605183251 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #39985: [SPARK-42412][WIP] Initial prototype implementation of PySpark ML via Spark connect
github-actions[bot] closed pull request #39985: [SPARK-42412][WIP] Initial prototype implementation of PySpark ML via Spark connect URL: https://github.com/apache/spark/pull/39985 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #40391: [SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers
github-actions[bot] closed pull request #40391: [SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers URL: https://github.com/apache/spark/pull/40391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df
github-actions[bot] closed pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df URL: https://github.com/apache/spark/pull/40122 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #40411: [SPARK-42781][DOCS][PYTHON] provide one format for writing to kafka
github-actions[bot] closed pull request #40411: [SPARK-42781][DOCS][PYTHON] provide one format for writing to kafka URL: https://github.com/apache/spark/pull/40411 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #40419: [SPARK-42789][SQL] Rewrite multiple GetJsonObjects to a JsonTuple if their json expressions are the same
github-actions[bot] closed pull request #40419: [SPARK-42789][SQL] Rewrite multiple GetJsonObjects to a JsonTuple if their json expressions are the same URL: https://github.com/apache/spark/pull/40419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #41715: [SPARK-44163][PYTHON] Handle `ModuleNotFoundError` in addition to `ImportError`
dongjoon-hyun closed pull request #41715: [SPARK-44163][PYTHON] Handle `ModuleNotFoundError` in addition to `ImportError` URL: https://github.com/apache/spark/pull/41715 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures
dtenedor commented on PR #41486: URL: https://github.com/apache/spark/pull/41486#issuecomment-1605060822 Hi @MaxGekk can we trouble you to help us merge this please when you have a moment :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #41715: [SPARK-44163][PYTHON] Handle `ModuleNotFoundError` in addition to `ImportError`
dongjoon-hyun opened a new pull request, #41715: URL: https://github.com/apache/spark/pull/41715 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures
dtenedor commented on code in PR #41486: URL: https://github.com/apache/spark/pull/41486#discussion_r1240463584 ## core/src/main/resources/error/error-classes.json: ## @@ -757,6 +757,21 @@ "The expression cannot be used as a grouping expression because its data type is not an orderable data type." ] }, + "HLL_INVALID_INPUT_SKETCH_BUFFER" : { +"message" : [ + "Invalid call to ; only valid HLL sketch buffers are supported as inputs (such as those produced by the HLL_SKETCH_AGG function)." Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures
dtenedor commented on code in PR #41486: URL: https://github.com/apache/spark/pull/41486#discussion_r1240463321 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datasketchesExpressions.scala: ## @@ -98,15 +104,22 @@ case class HllUnion(first: Expression, second: Expression, third: Expression) override def dataType: DataType = BinaryType override def nullSafeEval(value1: Any, value2: Any, value3: Any): Any = { -val sketch1 = HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]])) -val sketch2 = HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]])) +val sketch1 = try { + HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]])) +} catch { + case _: java.lang.Error => +throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName) +} +val sketch2 = try { + HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]])) +} catch { + case _: java.lang.Error => +throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName) +} val allowDifferentLgConfigK = value3.asInstanceOf[Boolean] if (!allowDifferentLgConfigK && sketch1.getLgConfigK != sketch2.getLgConfigK) { - throw new UnsupportedOperationException( -"Sketches have different lgConfigK values: " + -s"${sketch1.getLgConfigK} and ${sketch2.getLgConfigK}. " + -"Set allowDifferentLgConfigK to true to enable unions of " + -"different lgConfigK values.") + throw QueryExecutionErrors.hllUnionDifferentLgK( +sketch1.getLgConfigK, sketch2.getLgConfigK, function = "HLL_UNION") Review Comment: No worries, fixed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures
dtenedor commented on code in PR #41486: URL: https://github.com/apache/spark/pull/41486#discussion_r1240463194 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala: ## @@ -189,21 +189,20 @@ object HllSketchAgg { private val minLgConfigK = 4 private val maxLgConfigK = 21 - // Replicate Datasketche's HllUtil's checkLgK implementation, as we can't reference it directly + // Replicate Datasketches' HllUtil's checkLgK implementation, as we can't reference it directly. def checkLgK(lgConfigK: Int): Unit = { if (lgConfigK < minLgConfigK || lgConfigK > maxLgConfigK) { - throw new SketchesArgumentException( -s"Log K must be between $minLgConfigK and $maxLgConfigK, inclusive: " + lgConfigK) + throw QueryExecutionErrors.hllInvalidLgK(function = "HLL_SKETCH_AGG", Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mkaravel commented on a diff in pull request #41486: [SPARK-43986][SQL] Create error classes for HyperLogLog function call failures
mkaravel commented on code in PR #41486: URL: https://github.com/apache/spark/pull/41486#discussion_r1240413693 ## core/src/main/resources/error/error-classes.json: ## @@ -757,6 +757,21 @@ "The expression cannot be used as a grouping expression because its data type is not an orderable data type." ] }, + "HLL_INVALID_INPUT_SKETCH_BUFFER" : { +"message" : [ + "Invalid call to ; only valid HLL sketch buffers are supported as inputs (such as those produced by the HLL_SKETCH_AGG function)." Review Comment: ```suggestion "Invalid call to ; only valid HLL sketch buffers are supported as inputs (such as those produced by the `hll_sketch_agg` expression)." ``` The functions/expression names are all lowercase (see that `prettyName` string for the expressions). ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala: ## @@ -189,21 +189,20 @@ object HllSketchAgg { private val minLgConfigK = 4 private val maxLgConfigK = 21 - // Replicate Datasketche's HllUtil's checkLgK implementation, as we can't reference it directly + // Replicate Datasketches' HllUtil's checkLgK implementation, as we can't reference it directly. def checkLgK(lgConfigK: Int): Unit = { if (lgConfigK < minLgConfigK || lgConfigK > maxLgConfigK) { - throw new SketchesArgumentException( -s"Log K must be between $minLgConfigK and $maxLgConfigK, inclusive: " + lgConfigK) + throw QueryExecutionErrors.hllInvalidLgK(function = "HLL_SKETCH_AGG", Review Comment: ```suggestion throw QueryExecutionErrors.hllInvalidLgK(function = "hll_sketch_agg", ``` I missed this before (see also my other comment): we want to use the pretty name here, which is all lowercase. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datasketchesExpressions.scala: ## @@ -98,15 +104,22 @@ case class HllUnion(first: Expression, second: Expression, third: Expression) override def dataType: DataType = BinaryType override def nullSafeEval(value1: Any, value2: Any, value3: Any): Any = { -val sketch1 = HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]])) -val sketch2 = HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]])) +val sketch1 = try { + HllSketch.heapify(Memory.wrap(value1.asInstanceOf[Array[Byte]])) +} catch { + case _: java.lang.Error => +throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName) +} +val sketch2 = try { + HllSketch.heapify(Memory.wrap(value2.asInstanceOf[Array[Byte]])) +} catch { + case _: java.lang.Error => +throw QueryExecutionErrors.hllInvalidInputSketchBuffer(prettyName) +} val allowDifferentLgConfigK = value3.asInstanceOf[Boolean] if (!allowDifferentLgConfigK && sketch1.getLgConfigK != sketch2.getLgConfigK) { - throw new UnsupportedOperationException( -"Sketches have different lgConfigK values: " + -s"${sketch1.getLgConfigK} and ${sketch2.getLgConfigK}. " + -"Set allowDifferentLgConfigK to true to enable unions of " + -"different lgConfigK values.") + throw QueryExecutionErrors.hllUnionDifferentLgK( +sketch1.getLgConfigK, sketch2.getLgConfigK, function = "HLL_UNION") Review Comment: ```suggestion sketch1.getLgConfigK, sketch2.getLgConfigK, function = "hll_union") ``` Again, apologies for missing this in the previous rounds. I think we need to use the pretty name which is all lowercase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] szehon-ho commented on pull request #41683: [SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL
szehon-ho commented on PR #41683: URL: https://github.com/apache/spark/pull/41683#issuecomment-1604982731 Hm build failure doesnt seem related: ``` python/pyspark/mllib/clustering.py:781: error: Decorated property not supported [misc] python/pyspark/mllib/clustering.py:949: error: Decorated property not supported [misc] Found 199 errors in 24 files (checked 668 source files) 1 Error: Process completed with exit code 1. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #41714: [SPARK-44160][SQL][CONNECT] Extract shared code from StructType
amaliujia commented on PR #41714: URL: https://github.com/apache/spark/pull/41714#issuecomment-1604972107 cc @hvanhovell @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siying commented on pull request #41578: [SPARK-44044][SS] Improve Error message for Window functions with streaming
siying commented on PR #41578: URL: https://github.com/apache/spark/pull/41578#issuecomment-1604966953 @MaxGekk I addressed the comments. The CI failure is on python link. I don't know why it happens but I don't see how it is related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #41714: [SPARK-44160][SQL][CONNECT] Extract shared code from StructType
amaliujia opened a new pull request, #41714: URL: https://github.com/apache/spark/pull/41714 ### What changes were proposed in this pull request? The StructType has some methods that require CatalystParser and Catalyst expression. We are not planning to move the parser and expression to the shared module thus needs to do code split to share as much as code as possible between Scala client and Catalyst. ### Why are the changes needed? To simplify DataType interface as a whole thus those become the shared interface. ### Does this PR introduce _any_ user-facing change? This will break binary compatibility on StructType ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`
dongjoon-hyun closed pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts` URL: https://github.com/apache/spark/pull/41713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`
dongjoon-hyun commented on PR #41713: URL: https://github.com/apache/spark/pull/41713#issuecomment-1604941186 Thank you so much! Merged to master/3.4/3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on pull request #41701: [CONNECT][SPARK-44146] Isolate Spark Connect Session jars and classfiles
vicennial commented on PR #41701: URL: https://github.com/apache/spark/pull/41701#issuecomment-1604932773 The PR is ready to review (the falling tests are flaky, I've submitted a rerun request). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`
dongjoon-hyun commented on PR #41713: URL: https://github.com/apache/spark/pull/41713#issuecomment-1604904486 Could you review this PR when you have some time, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jdesjean commented on pull request #41443: [SPARK-43923][CONNECT] Post listenerBus events during ExecutePlanRequest
jdesjean commented on PR #41443: URL: https://github.com/apache/spark/pull/41443#issuecomment-1604796450 > > @beliefer the event API is open by design, it is made for others to use it. For example thriftserver uses it. This particular PR will be foundation on which we will be building a UI for connect. > > Thank you for the explanation. Can existing event types be shared by thriftserver and connect? The Thrift events are flexible and semantically the same as the Connect events. It's technically possible to reuse them, however there are some differences between both implementations. I believe it's preferable to create a new set of events so that both projects (Thrift & Connect) can evolve separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #41713: [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts`
dongjoon-hyun opened a new pull request, #41713: URL: https://github.com/apache/spark/pull/41713 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #41559: [SPARK-44030][SQL] Implement DataTypeExpression to offer Unapply for expression
amaliujia commented on code in PR #41559: URL: https://github.com/apache/spark/pull/41559#discussion_r1240066813 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeExpression.scala: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.types + +import org.apache.spark.sql.catalyst.expressions.Expression +import org.apache.spark.sql.errors.QueryExecutionErrors + +abstract class DataTypeExpression(val dataType: DataType) { + /** + * Enables matching against DataType for expressions: + * {{{ + * case Cast(child @ BinaryType(), StringType) => + * ... + * }}} + */ + private[sql] def unapply(e: Expression): Boolean = e.dataType == dataType +} + +abstract class AtomicTypeExpression(override val dataType: DataType) + extends DataTypeExpression(dataType) + +abstract class DatetimeTypeExpression(override val dataType: DataType) + extends AtomicTypeExpression(dataType) + +case object BooleanTypeExpression extends DataTypeExpression(BooleanType) +case object StringTypeExpression extends DataTypeExpression(StringType) +case object TimestampTypeExpression extends DataTypeExpression(TimestampType) +case object DateTypeExpression extends DataTypeExpression(DateType) +case object ByteTypeExpression extends DataTypeExpression(ByteType) +case object ShortTypeExpression extends DataTypeExpression(ShortType) +case object IntegerTypeExpression extends DataTypeExpression(IntegerType) +case object LongTypeExpression extends DataTypeExpression(LongType) +case object DoubleTypeExpression extends DataTypeExpression(DoubleType) +case object FloatTypeExpression extends DataTypeExpression(FloatType) + +object NumericTypeExpression { + /** + * Enables matching against NumericType for expressions: + * {{{ + * case Cast(child @ NumericType(), StringType) => + * ... + * }}} + */ + def unapply(e: Expression): Boolean = { Review Comment: Well after a second thought, I think you are right that the implementation can be as easy as a few lines. Though all the tests have passed by the current way. I changed the implementation to `e.dataType.isInstanceOf[NumericType]` and same for `IntegralType` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #41707: [SPARK-44151][BUILD] Upgrade `commons-codec` to 1.16.0
dongjoon-hyun commented on PR #41707: URL: https://github.com/apache/spark/pull/41707#issuecomment-1604550160 Merged to master for Apache Spark 3.5.0 Thank you, @panbingkun and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #41707: [SPARK-44151][BUILD] Upgrade `commons-codec` to 1.16.0
dongjoon-hyun closed pull request #41707: [SPARK-44151][BUILD] Upgrade `commons-codec` to 1.16.0 URL: https://github.com/apache/spark/pull/41707 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1240040307 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: @pan3793 . Here, the contact on `jmap` is just one of available binary, not the same JDK's `jmap`. However, I'm interested in this case. Could you make a small reproducer with Docker image? > Spark executor proccess respect JAVA_HOME to find $JAVA_HOME/bin/java but the subprocess can not find jmap even it knows where JAVA_HOME is. ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: @pan3793 . Here, the contract on `jmap` is just one of available binary, not the same JDK's `jmap`. However, I'm interested in this case. Could you make a small reproducer with Docker image? > Spark executor proccess respect JAVA_HOME to find $JAVA_HOME/bin/java but the subprocess can not find jmap even it knows where JAVA_HOME is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41700: [SPARK-44139][SQL] Discard completely pushed down filters in group-based MERGE operations
dongjoon-hyun commented on code in PR #41700: URL: https://github.com/apache/spark/pull/41700#discussion_r1240037361 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/GroupBasedRowLevelOperationScanPlanning.scala: ## @@ -56,29 +62,67 @@ object GroupBasedRowLevelOperationScanPlanning extends Rule[LogicalPlan] with Pr s""" |Pushing operators to ${relation.name} |Pushed filters: $pushedFiltersStr - |Filters that were not pushed: ${remainingFilters.mkString(", ")} + |Filters evaluated on data source side: ${evaluatedFilters.mkString(", ")} + |Filters evaluated on Spark side: ${postScanFilters.mkString(", ")} |Output: ${output.mkString(", ")} """.stripMargin) - // replace DataSourceV2Relation with DataSourceV2ScanRelation for the row operation table - // there may be multiple read relations for UPDATEs that are rewritten as UNION - rd transform { + rd transformDown { +// simplify the join condition in MERGE operations by discarding already evaluated filters +case j @ Join( +PhysicalOperation(_, _, r: DataSourceV2Relation), _, _, Some(cond), _) +if rd.operation.command == MERGE && evaluatedFilters.nonEmpty && r.table.eq(table) => + j.copy(condition = Some(optimizeMergeJoinCondition(cond, evaluatedFilters))) + +// replace DataSourceV2Relation with DataSourceV2ScanRelation for the row operation table +// there may be multiple read relations for UPDATEs that are rewritten as UNION case r: DataSourceV2Relation if r.table eq table => DataSourceV2ScanRelation(r, scan, PushDownUtils.toOutputAttrs(scan.readSchema(), r)) } } + // pushes down the operation condition and returns the following information: + // - pushed down filters + // - filter expressions that are fully evaluated on the data source side + // (such filters can be discarded and don't have to be evaluated again on the Spark side) + // - post-scan filter expressions that must be evaluated on the Spark side + // (such filters can overlap with pushed down filters, e.g. Parquet row group filtering) private def pushFilters( cond: Expression, tableAttrs: Seq[AttributeReference], - scanBuilder: ScanBuilder): (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression]) = { + scanBuilder: ScanBuilder) + : (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression], Seq[Expression]) = { Review Comment: In that case, since it's orthogonal to the main features, let's consider it later separately. For now, we don't need to change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #41700: [SPARK-44139][SQL] Discard completely pushed down filters in group-based MERGE operations
aokolnychyi commented on code in PR #41700: URL: https://github.com/apache/spark/pull/41700#discussion_r1240020952 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/GroupBasedRowLevelOperationScanPlanning.scala: ## @@ -56,29 +62,67 @@ object GroupBasedRowLevelOperationScanPlanning extends Rule[LogicalPlan] with Pr s""" |Pushing operators to ${relation.name} |Pushed filters: $pushedFiltersStr - |Filters that were not pushed: ${remainingFilters.mkString(", ")} + |Filters evaluated on data source side: ${evaluatedFilters.mkString(", ")} + |Filters evaluated on Spark side: ${postScanFilters.mkString(", ")} |Output: ${output.mkString(", ")} """.stripMargin) - // replace DataSourceV2Relation with DataSourceV2ScanRelation for the row operation table - // there may be multiple read relations for UPDATEs that are rewritten as UNION - rd transform { + rd transformDown { +// simplify the join condition in MERGE operations by discarding already evaluated filters +case j @ Join( +PhysicalOperation(_, _, r: DataSourceV2Relation), _, _, Some(cond), _) +if rd.operation.command == MERGE && evaluatedFilters.nonEmpty && r.table.eq(table) => + j.copy(condition = Some(optimizeMergeJoinCondition(cond, evaluatedFilters))) + +// replace DataSourceV2Relation with DataSourceV2ScanRelation for the row operation table +// there may be multiple read relations for UPDATEs that are rewritten as UNION case r: DataSourceV2Relation if r.table eq table => DataSourceV2ScanRelation(r, scan, PushDownUtils.toOutputAttrs(scan.readSchema(), r)) } } + // pushes down the operation condition and returns the following information: + // - pushed down filters + // - filter expressions that are fully evaluated on the data source side + // (such filters can be discarded and don't have to be evaluated again on the Spark side) + // - post-scan filter expressions that must be evaluated on the Spark side + // (such filters can overlap with pushed down filters, e.g. Parquet row group filtering) private def pushFilters( cond: Expression, tableAttrs: Seq[AttributeReference], - scanBuilder: ScanBuilder): (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression]) = { + scanBuilder: ScanBuilder) + : (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression], Seq[Expression]) = { Review Comment: I was debating something like this: ``` type PushedFilterInfo = (Either[Seq[Filter], Seq[V2Filter]], Seq[Expression], Seq[Expression]) ``` ``` private def pushFilters( cond: Expression, tableAttrs: Seq[AttributeReference], scanBuilder: ScanBuilder): PushedFilterInfo = { ... } ``` But then decided it wasn't worth it unless you think it makes the code more readable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #41700: [SPARK-44139][SQL] Discard completely pushed down filters in group-based MERGE operations
aokolnychyi commented on PR #41700: URL: https://github.com/apache/spark/pull/41700#issuecomment-1604524387 Thank you, @dongjoon-hyun! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a diff in pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
gatorsmile commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1239940518 ## dev/error_message_refiner.py: ## @@ -0,0 +1,219 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--gpt_version=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in +`core/src/main/resources/error/error-classes.json`. + +Options: +--gpt_version= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --gpt_version=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--gpt_version: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +from typing import Tuple, Optional +from sparktestsupport import SPARK_HOME + +PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json" +PATH_TO_API_KEY = f"{SPARK_HOME}/dev/api_key.txt" + +# You can obtain an API key from https://platform.openai.com/account/api-keys +openai.api_key = open(PATH_TO_API_KEY).read().rstrip("\n") + + +def _git_grep_files(search_string: str, exclude: str = None) -> str: Review Comment: How about asking LLM for generating the comments ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steven-aerts commented on pull request #41688: [WIP][SPARK-44132][SQL][TESTS] nesting full outer join confuses codegen
steven-aerts commented on PR #41688: URL: https://github.com/apache/spark/pull/41688#issuecomment-1604244490 Eclipsed by #41712. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steven-aerts closed pull request #41688: [WIP][SPARK-44132][SQL][TESTS] nesting full outer join confuses codegen
steven-aerts closed pull request #41688: [WIP][SPARK-44132][SQL][TESTS] nesting full outer join confuses codegen URL: https://github.com/apache/spark/pull/41688 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steven-aerts opened a new pull request, #41712: [SPARK-44132][SQL] join using Stream of column name fails codegen
steven-aerts opened a new pull request, #41712: URL: https://github.com/apache/spark/pull/41712 When nesting multiple joins using column names which are Stream the lazily evaluated stream can sometimes generate faulty codegen data. Resulting in a NPE or bad access of the data. ### What changes were proposed in this pull request? By materializing it as an IndexedSeq, before passing it to the lower layers, we prevent generating the wrong code which accesses the wrong data. ### Why are the changes needed? Otherwise the code will crash, see the 2 added test cases. Which show an NPE and a bad `UnsafeRow` access. ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? A reproduction scenario was created and added to the code base. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] 0xdarkman commented on pull request #36374: [SPARK-39006][K8S] Show a directional error message for executor PVC dynamic allocation failure
0xdarkman commented on PR #36374: URL: https://github.com/apache/spark/pull/36374#issuecomment-1604226893 I am running this version 3.4.1 and it looks good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
itholic commented on PR #41711: URL: https://github.com/apache/spark/pull/41711#issuecomment-1604003249 BTW, I set the category as `SQL` because it works based on the SQL error class, but I'm not sure if this is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
itholic commented on PR #41711: URL: https://github.com/apache/spark/pull/41711#issuecomment-1603986988 cc @gengliangwang @allisonwang-db @MaxGekk @cloud-fan could you take a look at this PR when you find some time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #41711: [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM
itholic opened a new pull request, #41711: URL: https://github.com/apache/spark/pull/41711 ### What changes were proposed in this pull request? This PR proposes adding a utility script that can help increase productivity in error message improvement tasks. When passing the error class name to the script, detailed information(e.g. error message, source code of the function where the error actually occurs) corresponding to the error class is sent to ChatGPT to request improvement of the error message. **Usage example:** ``` haejoon.lee@spark dev % python error_message_refiner.py AMBIGUOUS_LATERAL_COLUMN_ALIAS [AMBIGUOUS_LATERAL_COLUMN_ALIAS] Before: Lateral column alias is ambiguous and has matches. After: Lateral column alias is ambiguous and has matches. Please ensure that the referenced lateral column alias is uniquely identified, or provide a more specific alias in your query. ``` Developers only need to review the printed Before/After and apply it to the actual code if the content is appropriate. (Of course, extra refining should be done if necessary) ### Why are the changes needed? To enhance productivity in error message improvement tasks. ### Does this PR introduce _any_ user-facing change? No, this is an effort to improve productivity for developers. ### How was this patch tested? Manually tested. https://github.com/apache/spark/pull/41504 is an actual example that demonstrates the utilization of this script. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
pan3793 commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239566530 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: I suppose the `$JAVA_HOME/bin/jmap`(in above case, should be `/opt/openjdk-11/bin/jmap` instead of `/opt/openjdk-8/bin/jmap`) should be used, instead of the one first present in `PATH`. And if only `JAVA_HOME` is set, but no `jmap` in `PATH`, the invocation will fail. e.g. The JDK was installed by TGZ, it was unarchived to `/opt/openjdk-11` whithout exposing to `PATH` nor creating softlink to `/usr/bin`, then Spark executor proccess respect `JAVA_HOME` to find `$JAVA_HOME/bin/java` but the subprocess can not find `jmap` even it knows where `JAVA_HOME` is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
pan3793 commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239566530 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: I suppose the `$JAVA_HOME/bin/jmap`(in above case, should be `/opt/openjdk-11/bin/jmap` instead of `/opt/openjdk-8/bin/jmap`) should be used, instead of the one first present in `PATH`. And if only `JAVA_HOME` is set, but no `jmap` in `PATH`, the invocation will fail. e.g. The JDK was installed by TGZ, it is unarchived to `/opt/openjdk-11` whithout exposing to `PATH` nor creating softlink to `/usr/bin`, then Spark executor proccess respect `JAVA_HOME` to find `$JAVA_HOME/bin/java` but the subprocess can not find `jmap` even it knows where `JAVA_HOME` is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
pan3793 commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239566530 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: I suppose the `$JAVA_HOME/bin/jmap`(in above case, should be `/opt/openjdk-11/bin/jmap` instead of `/opt/openjdk-8/bin/jmap`) should be used, instead of the one first present in `PATH`. And if only `JAVA_HOME` is set, but no `jmap` in `PATH`, the invocation will fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zzzzming95 commented on a diff in pull request #41154: [SPARK-43327][CORE][3.3] Trigger `committer.setupJob` before plan execute in `FileFormatWriter#write`
ming95 commented on code in PR #41154: URL: https://github.com/apache/spark/pull/41154#discussion_r1239560157 ## sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala: ## @@ -1275,4 +1275,27 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with } } } + + test("SPARK-43327: location exists when insertoverwrite fails") { +withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { + withTable("t", "t1") { +sql("create table t(c1 int) using parquet") +sql("create table t1(c2 long) using parquet") +sql("INSERT OVERWRITE TABLE t1 select 644164") + +// spark.sql("CREATE TABLE IF NOT EXISTS t(amt1 int) using ORC") Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239558578 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: Like you pointed, there is no problem to run `jmap`, isn't it, @pan3793 ? IIRC, `jmap` is just CLI command without any format change. Especially, if you are saying JDKs here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239558578 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: Like you pointed, there is no problem to run `jmap`, isn't it, @pan3793 ? It's just CLI command without any format change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #41677: [WIP][SPARK-35564][SQL] Improve subexpression elimination
peter-toth commented on PR #41677: URL: https://github.com/apache/spark/pull/41677#issuecomment-1603934838 The failure in `[Run / Linters, licenses, dependencies and documentation generation]` seems unrelated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
pan3793 commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239521669 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: @dongjoon-hyun I mean where to find `jmap`, not environment variables propagation. Consider the following case: There is a chance that the multi JDK installed on the machine, let's say we have 8 and 11 install at ``` /opt/openjdk-8 /opt/openjdk-11 ``` and commands of 8 are added into `PATH`, `PATH=/opt/openjdk-8/bin:$PATH`. If the Spark executor process runs with `JAVA_HOME=/opt/openjdk-11`, with such invocation, even the sub-process know `JAVA_HOME` but still search `jmap` from `PATH`, then `/opt/openjdk-8/bin/jmap` will be used. ``` new ProcessBuilder("jmap", ...) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
pan3793 commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239521669 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: @dongjoon-hyun I mean where to find `jmap`, not environment variables propagation. Consider the following case: There is a chance that the multi JDK installed in the machine, let's say we have 8 and 11 install at ``` /opt/openjdk-8 /opt/openjdk-11 ``` and commands of 8 are added into `PATH`, `PATH=/opt/openjdk-8/bin:$PATH`. If the Spark executor process runs with `JAVA_HOME=/opt/openjdk-11`, with such invocation, even the sub-process know `JAVA_HOME` but still search `jmap` from `PATH`, then `/opt/openjdk-8/bin/jmap` will be used. ``` new ProcessBuilder("jmap", ...) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on PR #41709: URL: https://github.com/apache/spark/pull/41709#issuecomment-1603874002 Merged to master for Apache Spark 3.5.0. Thank you, @viirya and @pan3793 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun closed pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab URL: https://github.com/apache/spark/pull/41709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df
EnricoMi commented on PR #40122: URL: https://github.com/apache/spark/pull/40122#issuecomment-1603851057 @HyukjinKwon @cloud-fan @sunchao @zhengruifeng this is so sad. Can we get some feedback if this contribution is welcome or if any more effort is just wasted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #41710: [SPARK-43255][SQL] Replace the error class _LEGACY_ERROR_TEMP_2020 by an internal error
amaliujia commented on PR #41710: URL: https://github.com/apache/spark/pull/41710#issuecomment-1603761304 I did a search and looks like `constructorNotFoundError` may fit into internal errors which throws when a constructor is not found during reflection related operation, which is also at last hard to be triggered by user facing API. so this change is fine to me. But prefer @MaxGekk for another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239373825 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: And, the child process inherits the current process's environment, @pan3793 . - https://docs.oracle.com/javase/8/docs/api/java/lang/ProcessBuilder.html > an environment, which is a system-dependent mapping from variables to values. The initial value is a copy of the environment of the current process (see [System.getenv()](https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#getenv--)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #41708: [SPARK-44135][PYTHON][CONNECT][DOCS] Document Spark Connect only API in PySpark
HyukjinKwon closed pull request #41708: [SPARK-44135][PYTHON][CONNECT][DOCS] Document Spark Connect only API in PySpark URL: https://github.com/apache/spark/pull/41708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #41708: [SPARK-44135][PYTHON][CONNECT][DOCS] Document Spark Connect only API in PySpark
HyukjinKwon commented on PR #41708: URL: https://github.com/apache/spark/pull/41708#issuecomment-1603750548 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239373825 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: And, the child process inherits the current process's environment. - https://docs.oracle.com/javase/8/docs/api/java/lang/ProcessBuilder.html > an environment, which is a system-dependent mapping from variables to values. The initial value is a copy of the environment of the current process (see [System.getenv()](https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#getenv--)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
dongjoon-hyun commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239373194 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: We are inside `JVM` already. ``` $ docker run -it --rm openjdk:11 bash root@d7d08a90ec41:/# jshell Jun 23, 2023 6:30:14 AM java.util.prefs.FileSystemPreferences$1 run INFO: Created user preferences directory. | Welcome to JShell -- Version 11.0.16 | For an introduction type: /help intro jshell> var p = new ProcessBuilder("jmap", "-histo:live", String.valueOf(ProcessHandle.current().pid())).start() p ==> Process[pid=77, exitValue="not exited"] jshell> var r = new BufferedReader(new InputStreamReader(p.getInputStream())) r ==> java.io.BufferedReader@5fa7e7ff jshell> var line = "" line ==> "" jshell> while (line != null) { line = r.readLine(); System.out.println(line); } num #instances #bytes class name (module) --- 1: 8938 412040 [B (java.base@11.0.16) 2: 8705 208920 java.lang.String (java.base@11.0.16) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #41709: [SPARK-44153][CORE][UI] Support `Heap Histogram` column in `Executors` tab
pan3793 commented on code in PR #41709: URL: https://github.com/apache/spark/pull/41709#discussion_r1239357126 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -2287,6 +2287,22 @@ private[spark] object Utils extends Logging with SparkClassUtils { }.map(threadInfoToThreadStackTrace) } + /** Return a heap dump. Used to capture dumps for the web UI */ + def getHeapHistogram(): Array[String] = { +val pid = String.valueOf(ProcessHandle.current().pid()) +val builder = new ProcessBuilder("jmap", "-histo:live", pid) Review Comment: should it respect `JAVA_HOME` then `PATH`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org