Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394824967 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: ~This is `Spark Standalone` documetation, @bjornjorgensen ~ ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394824967 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: This is `Spark Standalone` documetation, @bjornjorgensen ~ ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: This is `Spark Standalone` documetation, @bjornjorgensen ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
bjornjorgensen commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394823713 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: should spark.executor.memoryOverheadFactor this be spark.kubernetes.executor.memoryOverheadFactor ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
dongjoon-hyun closed pull request #43822: [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 URL: https://github.com/apache/spark/pull/43822 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
dongjoon-hyun commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813276250 Ya, it looks like that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
bjornjorgensen commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813274288 https://github.com/bjornjorgensen/spark/actions/runs/6881177899 I pushed this 3 hours ago and waited to all the tests passed before I open this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
dongjoon-hyun commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813270978 BTW, your image looks like an old one (3 hours ago). This PR is created 18 minutes ago, isn't it? https://github.com/apache/spark/assets/9700541/a736d2d5-0885-4d91-8beb-211d97674f55;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
dongjoon-hyun commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813269062 You can provide the link here, @bjornjorgensen ~ That would be enough if the CI is running in your side successfully. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
bjornjorgensen commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813263774 it seams to be happy ![image](https://github.com/apache/spark/assets/47577197/2b9a1588-2337-402b-84c0-4337078d8b20) are there anything I can do with this then? ![image](https://github.com/apache/spark/assets/47577197/cd05d991-c212-4508-9e93-59b708397b47) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813263099 Could you review this PR, @bjornjorgensen ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
tgravescs commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1384058605 ## core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.HashMap + +import org.apache.spark.SparkException +import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT +import org.apache.spark.resource.ResourceProfile + +/** + * Class to hold information about a series of resources belonging to an executor. + * A resource could be a GPU, FPGA, etc. And it is used as a temporary + * class to calculate the resources amounts when offering resources to + * the tasks in the [[TaskSchedulerImpl]] + * + * One example is GPUs, where the addresses would be the indices of the GPUs + * + * @param resources The executor available resources and amount. eg, + * Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT, + * "1" -> 1.0*RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT, + *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT) + * ) + */ +private[spark] class ExecutorResourcesAmounts( +private val resources: Map[String, Map[String, Long]]) extends Serializable { + + /** + * convert the resources to be mutable HashMap + */ + private val internalResources: Map[String, HashMap[String, Long]] = { +resources.map { case (rName, addressAmounts) => + rName -> HashMap(addressAmounts.toSeq: _*) +} + } + + /** + * The total address count of each resource. Eg, + * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT)) + * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2) + */ + lazy val resourceAmount: Map[String, Int] = internalResources.map { case (rName, addressMap) => +rName -> addressMap.size + } + + /** + * For testing purpose. convert internal resources back to the "fraction" resources. + */ + private[spark] def availableResources: Map[String, Map[String, Double]] = { +internalResources.map { case (rName, addressMap) => + rName -> addressMap.map { case (address, amount) => +address -> amount.toDouble / RESOURCE_TOTAL_AMOUNT + }.toMap +} + } + + /** + * Acquire the resource and update the resource + * @param assignedResource the assigned resource information + */ + def acquire(assignedResource: Map[String, Map[String, Long]]): Unit = { +assignedResource.foreach { case (rName, taskResAmounts) => + val availableResourceAmounts = internalResources.getOrElse(rName, +throw new SparkException(s"Try to acquire an address from $rName that doesn't exist")) + taskResAmounts.foreach { case (address, amount) => +val prevInternalTotalAmount = availableResourceAmounts.getOrElse(address, + throw new SparkException(s"Try to acquire an address that doesn't exist. $rName " + +s"address $address doesn't exist.")) + +val left = prevInternalTotalAmount - amount +if (left < 0) { + throw new SparkException(s"The total amount ${left.toDouble / RESOURCE_TOTAL_AMOUNT} " + +s"after acquiring $rName address $address should be >= 0") +} +internalResources(rName)(address) = left + } +} + } + + /** + * Release the assigned resources to the resource pool + * @param assignedResource resource to be released + */ + def release(assignedResource: Map[String, Map[String, Long]]): Unit = { +assignedResource.foreach { case (rName, taskResAmounts) => + val availableResourceAmounts = internalResources.getOrElse(rName, +throw new SparkException(s"Try to release an address from $rName that doesn't exist")) + taskResAmounts.foreach { case (address, amount) => +val prevInternalTotalAmount =
Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]
dongjoon-hyun commented on PR #41072: URL: https://github.com/apache/spark/pull/41072#issuecomment-1813257544 Could you fix the compilation of your PRs, @thepinetree ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
dongjoon-hyun commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813254752 Thanks. Could you make CI happy, @bjornjorgensen ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45941][PS] Update `pandas` to version 2.1.3 [spark]
bjornjorgensen opened a new pull request, #43822: URL: https://github.com/apache/spark/pull/43822 ### What changes were proposed in this pull request? Update pandas from 2.1.2 to 2.1.3 ### Why are the changes needed? Fixed infinite recursion from operations that return a new object on some DataFrame subclasses ([GH 55763](https://github.com/pandas-dev/pandas/issues/55763)) and Fix [read_parquet()](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html#pandas.read_parquet) and [read_feather()](https://pandas.pydata.org/docs/reference/api/pandas.read_feather.html#pandas.read_feather) for [CVE-2023-47248](https://www.cve.org/CVERecord?id=CVE-2023-47248) ([GH 55894](https://github.com/pandas-dev/pandas/issues/55894)) [Release notes for 2.1.3](https://pandas.pydata.org/docs/whatsnew/v2.1.3.html) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
abellina commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1813246792 Thanks @mridulm, yes the commits make sense, it brings back the late initialization in the driver. I tested the change, the main difference from your patch @mridulm is I had to still get the shuffle manage class names using the method we added to the `ShuffleManager` object here https://github.com/apache/spark/pull/43627/files#diff-42a673b8fa5f2b999371dc97a5de7ebd2c2ec19447353d39efb7e8ebc012fe32R592, because the `shuffleManager` is not set yet at this point. @tgravescs fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
tgravescs commented on PR #43494: URL: https://github.com/apache/spark/pull/43494#issuecomment-1813235365 > 23/11/13 09:40:22 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (10.51.70.102, executor 0, partition 1, PROCESS_LOCAL, 7823 bytes) taskResourceAssignments Map(gpu -> Map(0 -> 2000)) > Should I change this behavior to align with the original one? Yeah that log message isn't very readable, for now lets just change it back to be similar to before format something like "resource" -> [Addresses], ie (gpu -> [0]). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45927][PYTHON] Update path handling in Python data source [spark]
allisonwang-db commented on code in PR #43809: URL: https://github.com/apache/spark/pull/43809#discussion_r1394689877 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val builder = sparkSession.sharedState.dataSourceManager.lookupDataSource(source) // Unless the legacy path option behavior is enabled, the extraOptions here // should not include "path" or "paths" as keys. -val plan = builder(sparkSession, source, paths, userSpecifiedSchema, extraOptions) +// Add path to the options field. Note currently it only supports a single path. +val optionsWithPath = if (paths.isEmpty) { + extraOptions +} else if (paths.length == 1) { +extraOptions + ("path" -> paths.head) +} else { + throw QueryCompilationErrors.multiplePathsUnsupportedError(source, paths) Review Comment: Yes we only apply this check to Python data sources (it's in `loadUserDefinedDataSource). The behavior is indeed different from the v1 source, but I find it more user-friendly to raise an explicit error than silently ignoring multiple paths. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45927][PYTHON] Update path handling in Python data source [spark]
allisonwang-db commented on code in PR #43809: URL: https://github.com/apache/spark/pull/43809#discussion_r1394685914 ## python/pyspark/sql/datasource.py: ## @@ -45,30 +45,19 @@ class DataSource(ABC): """ @final -def __init__( -self, -paths: List[str], -userSpecifiedSchema: Optional[StructType], Review Comment: This field is actually not used. Both the `reader` and `writer` functions take in the `schema` parameter, and we can pass in the actual schema there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]
dtenedor commented on code in PR #43682: URL: https://github.com/apache/spark/pull/43682#discussion_r1394630877 ## python/pyspark/sql/tests/test_udtf.py: ## @@ -2482,6 +2533,7 @@ def tearDownClass(cls): super(UDTFTests, cls).tearDownClass() +''' Review Comment: My mistake on this, reverted. ## python/pyspark/sql/tests/test_udtf.py: ## @@ -2775,7 +2827,7 @@ def tearDownClass(cls): cls.spark.conf.unset("spark.sql.execution.pythonUDTF.arrow.enabled") finally: super(UDTFArrowTests, cls).tearDownClass() - +''' Review Comment: My mistake on this, reverted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43393][SQL][3.5] Address sequence expression overflow bug. [spark]
dongjoon-hyun commented on PR #43820: URL: https://github.com/apache/spark/pull/43820#issuecomment-1813034968 Could you fix the compilation? ``` [error] /home/runner/work/spark/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala:787:27: not found: value toSQLId [error] "functionName" -> toSQLId("sequence"), [error] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]
dongjoon-hyun commented on PR #43821: URL: https://github.com/apache/spark/pull/43821#issuecomment-1813032065 Is GitHub Action triggered on this PR, @thepinetree ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]
dongjoon-hyun commented on PR #41072: URL: https://github.com/apache/spark/pull/41072#issuecomment-1813024170 Thank you so much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]
thepinetree commented on PR #41072: URL: https://github.com/apache/spark/pull/41072#issuecomment-1813023418 @dongjoon-hyun @cloud-fan Backport PRs: * 3.3: https://github.com/apache/spark/pull/43821 * 3.4: https://github.com/apache/spark/pull/43819 * 3.5: https://github.com/apache/spark/pull/43820 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]
thepinetree opened a new pull request, #43821: URL: https://github.com/apache/spark/pull/43821 ### What changes were proposed in this pull request? Spark has a (long-standing) overflow bug in the `sequence` expression. Consider the following operations: ``` spark.sql("CREATE TABLE foo (l LONG);") spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});") spark.sql("SELECT sequence(0, l) FROM foo;").collect() ``` The result of these operations will be: ``` Array[org.apache.spark.sql.Row] = Array([WrappedArray()]) ``` an unintended consequence of overflow. The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like: ``` 0111 - -- 0111 / 0001 -- 0111 + 0001 -- 1000 ``` The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`. Other overflows are similarly problematic. This PR addresses the issue by checking numeric operations in the length computation for overflow. ### Why are the changes needed? There is a correctness bug from overflow in the `sequence` expression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added in `CollectionExpressionsSuite.scala`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-43393][SQL][3.5] Address sequence expression overflow bug. [spark]
thepinetree opened a new pull request, #43820: URL: https://github.com/apache/spark/pull/43820 ### What changes were proposed in this pull request? Spark has a (long-standing) overflow bug in the `sequence` expression. Consider the following operations: ``` spark.sql("CREATE TABLE foo (l LONG);") spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});") spark.sql("SELECT sequence(0, l) FROM foo;").collect() ``` The result of these operations will be: ``` Array[org.apache.spark.sql.Row] = Array([WrappedArray()]) ``` an unintended consequence of overflow. The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like: ``` 0111 - -- 0111 / 0001 -- 0111 + 0001 -- 1000 ``` The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`. Other overflows are similarly problematic. This PR addresses the issue by checking numeric operations in the length computation for overflow. ### Why are the changes needed? There is a correctness bug from overflow in the `sequence` expression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added in `CollectionExpressionsSuite.scala`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-43393][SQL][3.4] Address sequence expression overflow bug. [spark]
thepinetree opened a new pull request, #43819: URL: https://github.com/apache/spark/pull/43819 ### What changes were proposed in this pull request? Spark has a (long-standing) overflow bug in the `sequence` expression. Consider the following operations: ``` spark.sql("CREATE TABLE foo (l LONG);") spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});") spark.sql("SELECT sequence(0, l) FROM foo;").collect() ``` The result of these operations will be: ``` Array[org.apache.spark.sql.Row] = Array([WrappedArray()]) ``` an unintended consequence of overflow. The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like: ``` 0111 - -- 0111 / 0001 -- 0111 + 0001 -- 1000 ``` The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`. Other overflows are similarly problematic. This PR addresses the issue by checking numeric operations in the length computation for overflow. ### Why are the changes needed? There is a correctness bug from overflow in the `sequence` expression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added in `CollectionExpressionsSuite.scala`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394491810 ## docs/running-on-kubernetes.md: ## @@ -590,41 +590,41 @@ Some of these include: See the [configuration page](configuration.html) for information on Spark configurations. The following configurations are specific to Spark on Kubernetes. - Spark Properties + Spark Properties (`spark.kubernetes.` prefix is omitted.) Property NameDefaultMeaningSince Version - spark.kubernetes.context Review Comment: That's true. Let me spin-off `kubernetes` part from this PR, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]
dongjoon-hyun commented on PR #41072: URL: https://github.com/apache/spark/pull/41072#issuecomment-1812937385 Oh this seems to break branch-3.5. - https://github.com/apache/spark/actions/runs/6873765275 Let me revert this from branch-3.5. Given the situation, we can start backport from branch-3.5 to branch-3.3 as three separate PRs, @thepinetree . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][DOCS] Regenerate `docs/sql-error-conditions.md` and add `42KDF` to `SQLSTATE table` in `error/README.md` [spark]
dongjoon-hyun closed pull request #43817: [SPARK-45562][DOCS] Regenerate `docs/sql-error-conditions.md` and add `42KDF` to `SQLSTATE table` in `error/README.md` URL: https://github.com/apache/spark/pull/43817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394491810 ## docs/running-on-kubernetes.md: ## @@ -590,41 +590,41 @@ Some of these include: See the [configuration page](configuration.html) for information on Spark configurations. The following configurations are specific to Spark on Kubernetes. - Spark Properties + Spark Properties (`spark.kubernetes.` prefix is omitted.) Property NameDefaultMeaningSince Version - spark.kubernetes.context Review Comment: That's true. Let me split `kubernetes` part from this PR, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]
tgravescs commented on PR #43746: URL: https://github.com/apache/spark/pull/43746#issuecomment-1812838085 > > Preemption on yarn shouldn't be going against the number of failed executors. If it is then something has changed and we should fix that. > Yes, you are right What do you mean by this, are you saying the Spark on YARN handling of preempted containers is not working properly? Meaning if the container is preempted it should not show up as an executor failure. Are you seeing those preempted containers show up as failed? Or are you saying that yes Spark on YARN doesn't mark preempted as failed? > What does 'this feature' point to? Sorry I misunderstood your environment here, I thought you were running on k8s but it looks like you running on YARN. by feature I mean the spark.yarn.max.executor.failures/spark.executor.maxNumFailures config and its functionality. So unless yarn preemption handling is broken (please answer question above), you gave one very specific use case where user added a bad JAR, in that use case it seems like you just don't want spark.executor.maxNumFailures enabled at all. You said you don't want the app to fail so admins can come fix things up and not have it affect other users. If that is the case then Spark should allow users to turn spark.executor.maxNumFailures off or I assume you could do the same thing by setting it to int.maxvalue. As implemented this seems very arbitrary and I would think hard for a normal user to set and use this feature. You have it as a ratio, which normally I'm in favor of but really only works if you have max executors set so it is really just a hardcoded number. That number seems arbitrary as its just depends on if you get lucky and happen to have that before some users pushes a bad jar. I don't understand why this isn't the same as minimum number of executors as that seems more in line - saying you need some minimum number for this application to run and by the way its ok to keep running with this is launching new executors is failing. If there is some other issues with Spark Connect and add jars maybe that is a different conversation about isolation (https://issues.apache.org/jira/browse/SPARK-44146). Or maybe it needs to better prevent users from adding jars with the same name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]
tgravescs commented on code in PR #43746: URL: https://github.com/apache/spark/pull/43746#discussion_r1394411937 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2087,6 +2087,17 @@ package object config { .doubleConf .createOptional + private[spark] val SCHEDULER_MIN_RESOURCES_TO_SURVIVE_RATIO = +ConfigBuilder("spark.scheduler.minResourcesToSurviveRatio") Review Comment: I misread the title, sorry ignore this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
mridulm commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1394409363 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java: ## @@ -182,23 +193,34 @@ public boolean skip(long n) { @Override public synchronized void close() throws IOException { -db.notifyIteratorClosed(this); +db.notifyIteratorClosed(it); if (!closed) { - it.close(); - closed = true; - next = null; + try { +it.close(); + } finally { +closed = true; +next = null; +cancelResourceClean(); Review Comment: This was perhaps discussed earlier - please point me to it if I am missing it. Why not simply call ` this.cleanable.clean();` in `close` (in place of notify + it.close) ? ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java: ## @@ -322,26 +323,15 @@ public void close() throws IOException { } } - /** - * Closes the given iterator if the DB is still open. Trying to close a JNI LevelDB handle - * with a closed DB can cause JVM crashes, so this ensures that situation does not happen. - */ - void closeIterator(LevelDBIterator it) throws IOException { -notifyIteratorClosed(it); -synchronized (this._db) { - DB _db = this._db.get(); - if (_db != null) { -it.close(); - } -} - } Review Comment: resource cleaner has a reference to both `dbIterator` and `levelDB` - not sure what I am missing ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
zhaomin1423 commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1394398171 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java: ## @@ -322,26 +323,15 @@ public void close() throws IOException { } } - /** - * Closes the given iterator if the DB is still open. Trying to close a JNI LevelDB handle - * with a closed DB can cause JVM crashes, so this ensures that situation does not happen. - */ - void closeIterator(LevelDBIterator it) throws IOException { -notifyIteratorClosed(it); -synchronized (this._db) { - DB _db = this._db.get(); - if (_db != null) { -it.close(); - } -} - } Review Comment: The main reason is ResourceCleaner can't hold references to the LevelDBIterator, so change it. If you have a better suggestion, please let me know and I will be happy to change it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
mridulm commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1394375372 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java: ## @@ -322,26 +323,15 @@ public void close() throws IOException { } } - /** - * Closes the given iterator if the DB is still open. Trying to close a JNI LevelDB handle - * with a closed DB can cause JVM crashes, so this ensures that situation does not happen. - */ - void closeIterator(LevelDBIterator it) throws IOException { -notifyIteratorClosed(it); -synchronized (this._db) { - DB _db = this._db.get(); - if (_db != null) { -it.close(); - } -} - } Review Comment: The logic is consistent - and would remain the same with and without this move - I was wondering why the change ? Whether this was simply refactoring or there is any other reason for it. (Note: I am not advocating for reverting, just want to understand why this was done). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
mridulm commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1394375372 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java: ## @@ -322,26 +323,15 @@ public void close() throws IOException { } } - /** - * Closes the given iterator if the DB is still open. Trying to close a JNI LevelDB handle - * with a closed DB can cause JVM crashes, so this ensures that situation does not happen. - */ - void closeIterator(LevelDBIterator it) throws IOException { -notifyIteratorClosed(it); -synchronized (this._db) { - DB _db = this._db.get(); - if (_db != null) { -it.close(); - } -} - } Review Comment: The logic is consistent - and would remain the same with and without this move - I was wondering why the change ? Whether this was simply refactoring or there is any other reason for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core` module in `module.py` [spark]
LuciferYang commented on code in PR #43818: URL: https://github.com/apache/spark/pull/43818#discussion_r1394365136 ## dev/sparktestsupport/modules.py: ## @@ -178,7 +178,7 @@ def __hash__(self): core = Module( name="core", -dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher], +dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, utils], Review Comment: cc @zhengruifeng @HyukjinKwon , `utils` module is also a direct dependency of unsafe and network-common. Do we need to modify the dependencies of `unsafe` and `network-common`? Or is the current change sufficient? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45938][INFRA] Add `utils` to the dependency list of the `core` module in `module.py` [spark]
LuciferYang opened a new pull request, #43818: URL: https://github.com/apache/spark/pull/43818 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45747][SS] Use prefix key information in state metadata to handle reading state for session window aggregation [spark]
HeartSaVioR commented on PR #43788: URL: https://github.com/apache/spark/pull/43788#issuecomment-1812633610 @chaoqin-li1123 Could you please rebase your change with latest master branch? merge script is confusing that I'm the main author due to my commits listed here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
zhaomin1423 commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1394277068 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java: ## @@ -322,26 +323,15 @@ public void close() throws IOException { } } - /** - * Closes the given iterator if the DB is still open. Trying to close a JNI LevelDB handle - * with a closed DB can cause JVM crashes, so this ensures that situation does not happen. - */ - void closeIterator(LevelDBIterator it) throws IOException { -notifyIteratorClosed(it); -synchronized (this._db) { - DB _db = this._db.get(); - if (_db != null) { -it.close(); - } -} - } Review Comment: It was originally in finalize method, now it has been moved to ResourceCleaner.run. I think this logic is consistent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
zhaomin1423 commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1394277068 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java: ## @@ -322,26 +323,15 @@ public void close() throws IOException { } } - /** - * Closes the given iterator if the DB is still open. Trying to close a JNI LevelDB handle - * with a closed DB can cause JVM crashes, so this ensures that situation does not happen. - */ - void closeIterator(LevelDBIterator it) throws IOException { -notifyIteratorClosed(it); -synchronized (this._db) { - DB _db = this._db.get(); - if (_db != null) { -it.close(); - } -} - } Review Comment: It was originally used in finalize method, but now it has been moved to ResourceCleaner.run. I think this logic is consistent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45747][SS] Use prefix key information in state metadata to handle reading state for session window aggregation [spark]
HeartSaVioR commented on PR #43788: URL: https://github.com/apache/spark/pull/43788#issuecomment-1812628737 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][DOCS] Regenerate `docs/sql-error-conditions.md` and add `42KDF` to `SQLSTATE table` in `error/README.md` [spark]
sandip-db commented on PR #43817: URL: https://github.com/apache/spark/pull/43817#issuecomment-1812621901 LGTM. @LuciferYang Thanks for the quick fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on PR #43796: URL: https://github.com/apache/spark/pull/43796#issuecomment-1812528892 wait https://github.com/apache/spark/pull/43817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45562][CORE][DOCS] Regenerate `docs/sql-error-conditions.md` and update `SQLSTATE table` in `error/README.md` [spark]
LuciferYang opened a new pull request, #43817: URL: https://github.com/apache/spark/pull/43817 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][CORE][DOCS] Regenerate `docs/sql-error-conditions.md` and add `42KDF` to `SQLSTATE table` in `error/README.md` [spark]
LuciferYang commented on PR #43817: URL: https://github.com/apache/spark/pull/43817#issuecomment-1812524632 cc @HyukjinKwon @beliefer @sandip-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL][FOLLOW-UP] XML: Fix SQLSTATE for missing rowTag error [spark]
LuciferYang commented on code in PR #43804: URL: https://github.com/apache/spark/pull/43804#discussion_r1394181483 ## docs/sql-error-conditions.md: ## @@ -2375,9 +2375,3 @@ The operation `` requires a ``. But `` is a The `` requires `` parameters but the actual number is ``. For more details see [WRONG_NUM_ARGS](sql-error-conditions-wrong-num-args-error-class.html) - -### XML_ROW_TAG_MISSING - -[SQLSTATE: 42000](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation) Review Comment: Give a fix: https://github.com/apache/spark/pull/43817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL][FOLLOW-UP] XML: Fix SQLSTATE for missing rowTag error [spark]
LuciferYang commented on code in PR #43804: URL: https://github.com/apache/spark/pull/43804#discussion_r1394156014 ## docs/sql-error-conditions.md: ## @@ -2375,9 +2375,3 @@ The operation `` requires a ``. But `` is a The `` requires `` parameters but the actual number is ``. For more details see [WRONG_NUM_ARGS](sql-error-conditions-wrong-num-args-error-class.html) - -### XML_ROW_TAG_MISSING - -[SQLSTATE: 42000](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation) Review Comment: Yes, this needs to be fixed. Additionally, the ` SQLSTATE table start ` in the `error/README.md` file also needs to be fixed, as there is no definition for the `42KDF` status. @sandip-db , could you quickly fix this issue? Or should we revert this pr first @HyukjinKwon ? GA keeps failing because of this issue. https://github.com/apache/spark/actions/runs/6874376893/job/18695941494 ``` [info] - SQLSTATE invariants *** FAILED *** (30 milliseconds) [info] fx.apply(s) was false 42KDF (SparkThrowableSuite.scala:74) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.SparkThrowableSuite.$anonfun$checkCondition$1(SparkThrowableSuite.scala:74) [info] at scala.collection.immutable.List.foreach(List.scala:333) [info] at org.apache.spark.SparkThrowableSuite.checkCondition(SparkThrowableSuite.scala:73) [info] at org.apache.spark.SparkThrowableSuite.$anonfun$new$6(SparkThrowableSuite.scala:138) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) [info] - Error classes match with document *** FAILED *** (71 milliseconds) [info] "...gs-error-class.html)[ [info] [info] ### XML_ROW_TAG_MISSING [info] [info] [SQLSTATE: 42KDF](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation) [info] [info] `` option is required for reading files in XML format.]" did not equal "...gs-error-class.html)[]" The error class document is not up to date. Please regenerate it. (SparkThrowableSuite.scala:346) [info] Analysis: [info] "...gs-error-class.html)[ [info] [info] ### XML_ROW_TAG_MISSING [info] [info] [SQLSTATE: 42KDF](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation) [info] [info] `` option is required for reading files in XML format.]" -> "...gs-error-class.html)[]" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]
cloud-fan closed pull request #43781: [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first URL: https://github.com/apache/spark/pull/43781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]
cloud-fan commented on PR #43781: URL: https://github.com/apache/spark/pull/43781#issuecomment-1812465509 The failure is unrelated ``` Extension error: Could not import extension sphinx_copybutton (exception: No module named 'sphinx_copybutton') make: *** [Makefile:35: html] Error 2 ``` I'm merging it to master, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
beliefer commented on PR #43806: URL: https://github.com/apache/spark/pull/43806#issuecomment-1812439348 > Ideally that should have happened, but what I see is one stage containing subquery adaptive exec and incoming exchange contains subqueryexec. Also this is just 1 of the issues. The main pr will be the one which requires additional functions to be implemented by DataSourceV2impl. Pls refer to the ticket which depends on this pr. Mat be this itself can be fixed by ensuring incoming exchange contains right exec. Then tinkering with canonicalize may not be needed. But in that case the buildPlan should be canonicalized That is impossible. `SubqueryAdaptiveBroadcastExec` will be eliminate with `DynamicPruningExpression`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout [spark]
yaooqinn commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394086066 ## docs/running-on-kubernetes.md: ## @@ -590,41 +590,41 @@ Some of these include: See the [configuration page](configuration.html) for information on Spark configurations. The following configurations are specific to Spark on Kubernetes. - Spark Properties + Spark Properties (`spark.kubernetes.` prefix is omitted.) Property NameDefaultMeaningSince Version - spark.kubernetes.context Review Comment: This might break the doc search? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on code in PR #43796: URL: https://github.com/apache/spark/pull/43796#discussion_r1394074332 ## common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java: ## @@ -1612,19 +1612,8 @@ private void verifyMetrics( assertEquals(expectedIgnoredBlocksBytes, ignoredBlockBytes.getCount(), "ignored block bytes"); } - private static class PushBlock { -private final int shuffleId; -private final int shuffleMergeId; -private final int mapIndex; -private final int reduceId; -private final ByteBuffer buffer; -PushBlock(int shuffleId, int shuffleMergeId, int mapIndex, int reduceId, ByteBuffer buffer) { - this.shuffleId = shuffleId; - this.shuffleMergeId = shuffleMergeId; - this.mapIndex = mapIndex; - this.reduceId = reduceId; - this.buffer = buffer; -} + private record PushBlock( + int shuffleId, int shuffleMergeId, int mapIndex, int reduceId, ByteBuffer buffer) { Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on code in PR #43796: URL: https://github.com/apache/spark/pull/43796#discussion_r1394064308 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java: ## @@ -153,24 +153,6 @@ private static ShuffleServiceMetricsInfo getShuffleServiceMetricsInfoForGenericV valueName + " value of " + baseName); } - private static class ShuffleServiceMetricsInfo implements MetricsInfo { Review Comment: Yes, Modifier `static` is redundant for inner records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on code in PR #43796: URL: https://github.com/apache/spark/pull/43796#discussion_r1394067701 ## common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java: ## @@ -1612,19 +1612,8 @@ private void verifyMetrics( assertEquals(expectedIgnoredBlocksBytes, ignoredBlockBytes.getCount(), "ignored block bytes"); } - private static class PushBlock { -private final int shuffleId; -private final int shuffleMergeId; -private final int mapIndex; -private final int reduceId; -private final ByteBuffer buffer; -PushBlock(int shuffleId, int shuffleMergeId, int mapIndex, int reduceId, ByteBuffer buffer) { - this.shuffleId = shuffleId; - this.shuffleMergeId = shuffleMergeId; - this.mapIndex = mapIndex; - this.reduceId = reduceId; - this.buffer = buffer; -} + private record PushBlock( + int shuffleId, int shuffleMergeId, int mapIndex, int reduceId, ByteBuffer buffer) { Review Comment: Ok, let me revert this one. The previous commit did not revert it due to this in test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on code in PR #43796: URL: https://github.com/apache/spark/pull/43796#discussion_r1394064308 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java: ## @@ -153,24 +153,6 @@ private static ShuffleServiceMetricsInfo getShuffleServiceMetricsInfoForGenericV valueName + " value of " + baseName); } - private static class ShuffleServiceMetricsInfo implements MetricsInfo { Review Comment: Yes, Modifier 'static' is redundant for inner records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45851][CONNECT][SCALA] Support multiple policies in scala client [spark]
cdkrot commented on PR #43757: URL: https://github.com/apache/spark/pull/43757#issuecomment-1812291366 cc @hvanhovell let's merge when tests pass :). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
dongjoon-hyun commented on code in PR #43796: URL: https://github.com/apache/spark/pull/43796#discussion_r1394022682 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java: ## @@ -153,24 +153,6 @@ private static ShuffleServiceMetricsInfo getShuffleServiceMetricsInfoForGenericV valueName + " value of " + baseName); } - private static class ShuffleServiceMetricsInfo implements MetricsInfo { Review Comment: `static class` is equivalent to `record`? Specifically, `static` part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
dongjoon-hyun commented on code in PR #43796: URL: https://github.com/apache/spark/pull/43796#discussion_r1394018312 ## common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java: ## @@ -1612,19 +1612,8 @@ private void verifyMetrics( assertEquals(expectedIgnoredBlocksBytes, ignoredBlockBytes.getCount(), "ignored block bytes"); } - private static class PushBlock { -private final int shuffleId; -private final int shuffleMergeId; -private final int mapIndex; -private final int reduceId; -private final ByteBuffer buffer; -PushBlock(int shuffleId, int shuffleMergeId, int mapIndex, int reduceId, ByteBuffer buffer) { - this.shuffleId = shuffleId; - this.shuffleMergeId = shuffleMergeId; - this.mapIndex = mapIndex; - this.reduceId = reduceId; - this.buffer = buffer; -} + private record PushBlock( + int shuffleId, int shuffleMergeId, int mapIndex, int reduceId, ByteBuffer buffer) { Review Comment: These fields should be `private`, shouldn't these? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1812232936 Could you review this when you have some time, @yaooqinn ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45915][SQL] Treat decimal(x, 0) the same as IntegralType in `PromoteStrings` [spark]
dongjoon-hyun commented on PR #43812: URL: https://github.com/apache/spark/pull/43812#issuecomment-1812231907 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45915][SQL] Treat decimal(x, 0) the same as IntegralType in `PromoteStrings` [spark]
dongjoon-hyun closed pull request #43812: [SPARK-45915][SQL] Treat decimal(x, 0) the same as IntegralType in `PromoteStrings` URL: https://github.com/apache/spark/pull/43812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error [spark]
panbingkun opened a new pull request, #43815: URL: https://github.com/apache/spark/pull/43815 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45851][CONNECT][SCALA] Support multiple policies in scala client [spark]
cdkrot commented on code in PR #43757: URL: https://github.com/apache/spark/pull/43757#discussion_r1394006876 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/RetryPolicy.scala: ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connect.client + +import scala.concurrent.duration.{Duration, FiniteDuration} +import scala.util.Random + +import io.grpc.{Status, StatusRuntimeException} + +/** + * [[RetryPolicy]] configure the retry mechanism in [[GrpcRetryHandler]] + * + * @param maxRetries + * Maximum number of retries. + * @param initialBackoff + * Start value of the exponential backoff (ms). + * @param maxBackoff + * Maximal value of the exponential backoff (ms). + * @param backoffMultiplier + * Multiplicative base of the exponential backoff. + * @param canRetry + * Function that determines whether a retry is to be performed in the event of an error. + */ +case class RetryPolicy( + maxRetries: Option[Int] = None, + initialBackoff: FiniteDuration = FiniteDuration(1000, "ms"), + maxBackoff: Option[FiniteDuration] = None, + backoffMultiplier: Double = 1.0, + jitter: FiniteDuration = FiniteDuration(0, "s"), + minJitterThreshold: FiniteDuration = FiniteDuration(0, "s"), + canRetry: Throwable => Boolean, + name: String) { + + def getName: String = name + + def toState: RetryPolicy.RetryPolicyState = new RetryPolicy.RetryPolicyState(this) +} + +object RetryPolicy { + def defaultPolicy(): RetryPolicy = RetryPolicy( +name = "DefaultPolicy", +// Please synchronize changes here with Python side: +// pyspark/sql/connect/client/core.py +// +// Note: these constants are selected so that the maximum tolerated wait is guaranteed +// to be at least 10 minutes +maxRetries = Some(15), +initialBackoff = FiniteDuration(50, "ms"), +maxBackoff = Some(FiniteDuration(1, "min")), +backoffMultiplier = 4.0, +jitter = FiniteDuration(500, "ms"), +minJitterThreshold = FiniteDuration(2, "s"), +canRetry = defaultPolicyRetryException) + + // list of policies to be used by this client + def defaultPolicies(): Seq[RetryPolicy] = List(defaultPolicy()) + + // represents a state of the specific policy + // (how many retries have happened and how much to wait until next one) + private class RetryPolicyState(val policy: RetryPolicy) { +private var numberAttempts = 0 +private var nextWait: Duration = policy.initialBackoff + +// return waiting time until next attempt, or None if has exceeded max retries +def nextAttempt(): Option[Duration] = { + if (policy.maxRetries.isDefined && numberAttempts >= policy.maxRetries.get) { +return None + } + + numberAttempts += 1 + + var currentWait = nextWait + nextWait = nextWait * policy.backoffMultiplier + if (policy.maxBackoff.isDefined) { +nextWait = nextWait min policy.maxBackoff.get + } + + if (currentWait >= policy.minJitterThreshold) { +currentWait += Random.nextDouble() * policy.jitter + } + + Some(currentWait) +} + +def canRetry(throwable: Throwable): Boolean = policy.canRetry(throwable) + +def getName: String = policy.getName + } + + /** + * Default canRetry in [[RetryPolicy]]. + * + * @param e + * The exception to check. + * @return + * true if the exception is a [[StatusRuntimeException]] with code UNAVAILABLE. + */ + private[client] def defaultPolicyRetryException(e: Throwable): Boolean = { +e match { + case _: RetryPolicy.RetryException => true + case e: StatusRuntimeException => +val statusCode: Status.Code = e.getStatus.getCode + +if (statusCode == Status.Code.INTERNAL) { + val msg: String = e.toString + + // This error happens if another RPC preempts this RPC. + if (msg.contains("INVALID_CURSOR.DISCONNECTED")) { +return true + } +} + +if (statusCode == Status.Code.UNAVAILABLE) { + return true +} +false + case _ => false +} + } + + /** + * An exception that can be thrown upstream when inside retry and which will be always
Re: [PR] [SPARK-45851][CONNECT][SCALA] Support multiple policies in scala client [spark]
cdkrot commented on code in PR #43757: URL: https://github.com/apache/spark/pull/43757#discussion_r1394006876 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/RetryPolicy.scala: ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connect.client + +import scala.concurrent.duration.{Duration, FiniteDuration} +import scala.util.Random + +import io.grpc.{Status, StatusRuntimeException} + +/** + * [[RetryPolicy]] configure the retry mechanism in [[GrpcRetryHandler]] + * + * @param maxRetries + * Maximum number of retries. + * @param initialBackoff + * Start value of the exponential backoff (ms). + * @param maxBackoff + * Maximal value of the exponential backoff (ms). + * @param backoffMultiplier + * Multiplicative base of the exponential backoff. + * @param canRetry + * Function that determines whether a retry is to be performed in the event of an error. + */ +case class RetryPolicy( + maxRetries: Option[Int] = None, + initialBackoff: FiniteDuration = FiniteDuration(1000, "ms"), + maxBackoff: Option[FiniteDuration] = None, + backoffMultiplier: Double = 1.0, + jitter: FiniteDuration = FiniteDuration(0, "s"), + minJitterThreshold: FiniteDuration = FiniteDuration(0, "s"), + canRetry: Throwable => Boolean, + name: String) { + + def getName: String = name + + def toState: RetryPolicy.RetryPolicyState = new RetryPolicy.RetryPolicyState(this) +} + +object RetryPolicy { + def defaultPolicy(): RetryPolicy = RetryPolicy( +name = "DefaultPolicy", +// Please synchronize changes here with Python side: +// pyspark/sql/connect/client/core.py +// +// Note: these constants are selected so that the maximum tolerated wait is guaranteed +// to be at least 10 minutes +maxRetries = Some(15), +initialBackoff = FiniteDuration(50, "ms"), +maxBackoff = Some(FiniteDuration(1, "min")), +backoffMultiplier = 4.0, +jitter = FiniteDuration(500, "ms"), +minJitterThreshold = FiniteDuration(2, "s"), +canRetry = defaultPolicyRetryException) + + // list of policies to be used by this client + def defaultPolicies(): Seq[RetryPolicy] = List(defaultPolicy()) + + // represents a state of the specific policy + // (how many retries have happened and how much to wait until next one) + private class RetryPolicyState(val policy: RetryPolicy) { +private var numberAttempts = 0 +private var nextWait: Duration = policy.initialBackoff + +// return waiting time until next attempt, or None if has exceeded max retries +def nextAttempt(): Option[Duration] = { + if (policy.maxRetries.isDefined && numberAttempts >= policy.maxRetries.get) { +return None + } + + numberAttempts += 1 + + var currentWait = nextWait + nextWait = nextWait * policy.backoffMultiplier + if (policy.maxBackoff.isDefined) { +nextWait = nextWait min policy.maxBackoff.get + } + + if (currentWait >= policy.minJitterThreshold) { +currentWait += Random.nextDouble() * policy.jitter + } + + Some(currentWait) +} + +def canRetry(throwable: Throwable): Boolean = policy.canRetry(throwable) + +def getName: String = policy.getName + } + + /** + * Default canRetry in [[RetryPolicy]]. + * + * @param e + * The exception to check. + * @return + * true if the exception is a [[StatusRuntimeException]] with code UNAVAILABLE. + */ + private[client] def defaultPolicyRetryException(e: Throwable): Boolean = { +e match { + case _: RetryPolicy.RetryException => true + case e: StatusRuntimeException => +val statusCode: Status.Code = e.getStatus.getCode + +if (statusCode == Status.Code.INTERNAL) { + val msg: String = e.toString + + // This error happens if another RPC preempts this RPC. + if (msg.contains("INVALID_CURSOR.DISCONNECTED")) { +return true + } +} + +if (statusCode == Status.Code.UNAVAILABLE) { + return true +} +false + case _ => false +} + } + + /** + * An exception that can be thrown upstream when inside retry and which will be always
[PR] [SPARK-45934][DOCS] Fix `spark-standalone.md` and `running-on-kubernetes.md` table layout [spark]
dongjoon-hyun opened a new pull request, #43814: URL: https://github.com/apache/spark/pull/43814 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45851][CONNECT][SCALA] Support multiple policies in scala client [spark]
cdkrot commented on code in PR #43757: URL: https://github.com/apache/spark/pull/43757#discussion_r1394001717 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/RetryPolicy.scala: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connect.client + +import scala.concurrent.duration.{Duration, FiniteDuration} +import scala.util.Random + +import io.grpc.{Status, StatusRuntimeException} + +/** + * [[RetryPolicy]] configure the retry mechanism in [[GrpcRetryHandler]] + * + * @param maxRetries + * Maximum number of retries. + * @param initialBackoff + * Start value of the exponential backoff (ms). + * @param maxBackoff + * Maximal value of the exponential backoff (ms). + * @param backoffMultiplier + * Multiplicative base of the exponential backoff. + * @param canRetry + * Function that determines whether a retry is to be performed in the event of an error. + */ +case class RetryPolicy( + maxRetries: Option[Int] = None, + initialBackoff: FiniteDuration = FiniteDuration(1000, "ms"), + maxBackoff: Option[FiniteDuration] = None, + backoffMultiplier: Double = 1.0, + jitter: FiniteDuration = FiniteDuration(0, "s"), + minJitterThreshold: FiniteDuration = FiniteDuration(0, "s"), + canRetry: Throwable => Boolean, + name: String) { + + def getName: String = name + + def toState: RetryPolicy.RetryPolicyState = new RetryPolicy.RetryPolicyState(this) +} + +object RetryPolicy { + def defaultPolicy(): RetryPolicy = RetryPolicy( +name = "DefaultPolicy", +// Please synchronize changes here with Python side: +// pyspark/sql/connect/client/core.py Review Comment: Might make sense :). I had a reasoning that Prs should be linearizable and did the change there and let it stay there then -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on PR #43796: URL: https://github.com/apache/spark/pull/43796#issuecomment-1812139559 > > @dongjoon-hyun I want to clarify the issue. We don't want to use `record` here because `field` in the original class doesn't provide an Accessor, but since `record` automatically generates an Accessor method, it exposes the information of `field`. So, in similar scenarios, we shouldn't use `record`, right? > > Yes, exactly. That's the point. Is it ok now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44496][SQL][FOLLOW-UP] CalendarIntervalType is also orderable [spark]
cloud-fan commented on PR #43805: URL: https://github.com/apache/spark/pull/43805#issuecomment-1812104254 So we just need to update the doc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Fix some typo [spark]
panbingkun commented on code in PR #43724: URL: https://github.com/apache/spark/pull/43724#discussion_r1393903025 ## docs/cloud-integration.md: ## @@ -121,12 +121,12 @@ for talking to cloud infrastructures, in which case this module may not be neede Spark jobs must authenticate with the object stores to access data within them. 1. When Spark is running in a cloud infrastructure, the credentials are usually automatically set up. -1. `spark-submit` is able to read the `AWS_ENDPOINT_URL`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` +2. `spark-submit` is able to read the `AWS_ENDPOINT_URL`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45904][SQL][CONNECT] Mode function should supports sort with order direction [spark]
cloud-fan commented on code in PR #43786: URL: https://github.com/apache/spark/pull/43786#discussion_r1393889797 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -842,19 +842,20 @@ object functions { * @group agg_funcs * @since 3.4.0 */ - def mode(e: Column): Column = mode(e, deterministic = false) + def mode(e: Column): Column = Column.fn("mode", e) /** * Aggregate function: returns the most frequent value in a group. * - * When multiple values have the same greatest frequency then either any of values is returned - * if deterministic is false or is not defined, or the lowest value is returned if deterministic - * is true. + * If there are multiple values with the greatest frequency only one value will be returned. The + * value will be chosen based on optional sort direction. Use ascending order to get smallest + * value or descending order to get largest value from multiple values with the same frequency. + * If this clause is not specified the exact chosen value is not determined. * * @group agg_funcs * @since 4.0.0 */ - def mode(e: Column, deterministic: Boolean): Column = Column.fn("mode", e, lit(deterministic)) + def mode(e: Column, isSortAsc: Boolean): Column = Column.fn("mode", e, lit(isSortAsc)) Review Comment: If we do want to support specify ordering, we should support ordering columns. I don't agree with this partial API change. Let's finish the SQL side first, then we can think about how the Scala API should look like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45915][SQL] Treat decimal(x, 0) the same as IntegralType in `PromoteStrings` [spark]
wangyum commented on code in PR #43812: URL: https://github.com/apache/spark/pull/43812#discussion_r1393881574 ## sql/core/src/test/resources/sql-tests/analyzer-results/typeCoercion/native/binaryComparison.sql.out: ## @@ -1330,7 +1330,7 @@ Project [NOT (cast(cast(null as string) as bigint) = cast(1 as bigint)) AS (NOT -- !query SELECT cast(1 as decimal(10, 0)) = '1' FROM t -- !query analysis -Project [(cast(cast(1 as decimal(10,0)) as double) = cast(1 as double)) AS (CAST(1 AS DECIMAL(10,0)) = 1)#x] +Project [(cast(1 as decimal(10,0)) = cast(1 as decimal(10,0))) AS (CAST(1 AS DECIMAL(10,0)) = 1)#x] Review Comment: Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
ahshahid commented on PR #43806: URL: https://github.com/apache/spark/pull/43806#issuecomment-1812050963 > AFAIK, the `SubqueryAdaptiveBroadcastExec` only used for dynamic partition pruning. `SubqueryAdaptiveBroadcastExec` will be replaced with `SubqueryBroadcastExec` and the later must reuse the broadcast exchange. Ideally that should have happened, but what I see is one stage containing subquery adaptive exec and incoming exchange contains subqueryexec. Also this is just 1 of the issues. The main pr will be the one which requires additional functions to be implemented by DataSourceV2impl. Pls refer to the ticket which depends on this pr. Mat be this itself can be fixed by ensuring incoming exchange contains right exec. Then tinkering with canonicalize may not be needed. But in that case the buildPlan should be canonicalized -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify collect_list -> ArrayType [spark]
landlord-matt commented on PR #43787: URL: https://github.com/apache/spark/pull/43787#issuecomment-1812049206 @HyukjinKwon: I now also did a similar thing for collect_set. Any comments on the proposal? I thought you would appreciate this suggestion :'( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
beliefer commented on code in PR #43806: URL: https://github.com/apache/spark/pull/43806#discussion_r1393856527 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryAdaptiveBroadcastExec.scala: ## @@ -46,7 +46,8 @@ case class SubqueryAdaptiveBroadcastExec( protected override def doCanonicalize(): SparkPlan = { val keys = buildKeys.map(k => QueryPlan.normalizeExpressions(k, child.output)) -copy(name = "dpp", buildKeys = keys, child = child.canonicalized) +SubqueryBroadcastExec(name = "dpp", index = index, buildKeys = keys, Review Comment: The implementation looks some against the design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
beliefer commented on PR #43806: URL: https://github.com/apache/spark/pull/43806#issuecomment-1812033293 AFAIK, the `SubqueryAdaptiveBroadcastExec` only used for dynamic partition pruning. `SubqueryAdaptiveBroadcastExec` will be replaced with `SubqueryBroadcastExec` and the later must reuse the broadcast exchange. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45764][PYTHON][DOCS] Make code block copyable [spark]
HyukjinKwon commented on PR #43799: URL: https://github.com/apache/spark/pull/43799#issuecomment-1812018431 @panbingkun would you mind creating a backporting PR? Actually yeah I think it's an important improvement in docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
dongjoon-hyun commented on PR #43796: URL: https://github.com/apache/spark/pull/43796#issuecomment-1812013863 > @dongjoon-hyun I want to clarify the issue. We don't want to use `record` here because `field` in the original class doesn't provide an Accessor, but since `record` automatically generates an Accessor method, it exposes the information of `field`. So, in similar scenarios, we shouldn't use `record`, right? Yes, exactly. That's the point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL][FOLLOW-UP] XML: Fix SQLSTATE for missing rowTag error [spark]
beliefer commented on code in PR #43804: URL: https://github.com/apache/spark/pull/43804#discussion_r1393832007 ## docs/sql-error-conditions.md: ## @@ -2375,9 +2375,3 @@ The operation `` requires a ``. But `` is a The `` requires `` parameters but the actual number is ``. For more details see [WRONG_NUM_ARGS](sql-error-conditions-wrong-num-args-error-class.html) - -### XML_ROW_TAG_MISSING - -[SQLSTATE: 42000](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation) Review Comment: We don't need update this one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45926][SQL] Implementing equals and hashCode which takes into account pushed runtime filters , in InMemoryTable related scans [spark]
beliefer commented on PR #43808: URL: https://github.com/apache/spark/pull/43808#issuecomment-1812006973 It seems this PR is unrelated to runtime filter. I guess you mean is DS V2 filter pushdown -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45915][SQL] Treat decimal(x, 0) the same as IntegralType in `PromoteStrings` [spark]
dongjoon-hyun commented on code in PR #43812: URL: https://github.com/apache/spark/pull/43812#discussion_r1393823245 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -934,8 +934,8 @@ object TypeCoercion extends TypeCoercionBase { // There is no proper decimal type we can pick, // using double type is the best we can do. // See SPARK-22469 for details. -case (n: DecimalType, s: StringType) => Some(DoubleType) -case (s: StringType, n: DecimalType) => Some(DoubleType) +case (DecimalType.Fixed(_, s), _: StringType) if s > 0 => Some(DoubleType) +case (_: StringType, DecimalType.Fixed(_, s)) if s > 0 => Some(DoubleType) Review Comment: Could you add some description about `s == 0` case after the line 936, please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45898][SQL] Support groupingSets operation in dataframe api [spark]
JacobZheng0927 opened a new pull request, #43813: URL: https://github.com/apache/spark/pull/43813 ### What changes were proposed in this pull request? Add groupingSets method in dataset api. `select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())` This SQL can be equivalently replaced with the following code: `df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")` ### Why are the changes needed? Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces the use of groupingSets in the dataset api. ### How was this patch tested? Tests added in `DataFrameAggregateSuite.scala`. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45915][SQL] Treat decimal(x, 0) the same as IntegralType in `PromoteStrings` [spark]
dongjoon-hyun commented on code in PR #43812: URL: https://github.com/apache/spark/pull/43812#discussion_r1393820692 ## sql/core/src/test/resources/sql-tests/analyzer-results/typeCoercion/native/binaryComparison.sql.out: ## @@ -1330,7 +1330,7 @@ Project [NOT (cast(cast(null as string) as bigint) = cast(1 as bigint)) AS (NOT -- !query SELECT cast(1 as decimal(10, 0)) = '1' FROM t -- !query analysis -Project [(cast(cast(1 as decimal(10,0)) as double) = cast(1 as double)) AS (CAST(1 AS DECIMAL(10,0)) = 1)#x] +Project [(cast(1 as decimal(10,0)) = cast(1 as decimal(10,0))) AS (CAST(1 AS DECIMAL(10,0)) = 1)#x] Review Comment: So, is this the main motivation? `cast(1 as double)` -> `cast(1 as decimal(10,0))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44496][SQL][FOLLOW-UP] CalendarIntervalType is also orderable [spark]
cloud-fan commented on PR #43805: URL: https://github.com/apache/spark/pull/43805#issuecomment-1811991530 it's not really comparable as we can't compare `30 days` and `1 month`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]
cloud-fan commented on PR #41072: URL: https://github.com/apache/spark/pull/41072#issuecomment-1811989510 SGTM. @thepinetree can you help to create backport PRs? thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45764][PYTHON][DOCS] Make code block copyable [spark]
zhengruifeng commented on PR #43799: URL: https://github.com/apache/spark/pull/43799#issuecomment-1811986470 shall we backport this to other branches? so that new maintenance releases will benefit from this fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45925][SQL] Making SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec [spark]
ahshahid commented on PR #43807: URL: https://github.com/apache/spark/pull/43807#issuecomment-1811975176 I think if I get clean run on PR [SPARK-45924](https://github.com/apache/spark/pull/43806), then this PR can be closed without merging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
ahshahid commented on PR #43806: URL: https://github.com/apache/spark/pull/43806#issuecomment-1811971501 I have reworked the PR to just canonicalize the SubqueryAdaptiveBroadcastExec as SubqueryBroadcastExec. This also fixes the reuse of exchange issue and seems to be lesser impacting change. It also means that PR [SPARK-45925](https://github.com/apache/spark/pull/43807) can be closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org